apple / foundationdb

FoundationDB - the open source, distributed, transactional key-value store
https://apple.github.io/foundationdb/
Apache License 2.0
14.53k stars 1.31k forks source link

Confusing behavior of no-wait exclude when a process is not reporting to the cluster #5871

Open brownleej opened 3 years ago

brownleej commented 3 years ago

When running the exclude no_wait command with a process that is not reporting to the database, the CLI reports a message of the form: WARNING: Missing from cluster! Be sure that you excluded the correct processes before removing them from the cluster!. It reports the same message whether the address is completely unknown to the database, or whether it is a process that is associated with data that has not been fully re-replicated. This means that we cannot use the output of exclude no_wait to determine if the re-replication for that process has completed. This makes it difficult to determine if it is safe to permanently destroy resources associated with a process that is temporarily unavailable. By comparison, the blocking form of the exclude command will block when a process is in this state, until the data is replicated. I think we should change this behavior to give a clearer signal on processes that are missing but have data, and align the no-wait exclude and the blocking exclude more.

jzhou77 commented 3 years ago

What's your preferred way to get the signal out? Is it some text or error code from the command line?

sfc-gh-clin commented 3 years ago

I already added a special key range (\xff\xff/management/in_progress_exclusion/, \xff\xff/management/in_progress_exclusion0) to tell what processes are in progress of excluding, which means the data replication is not finished. It's trivial to add a new fdbcli interface for this like

excludeInProgress

to print out any processes not finished yet. Is this something helpful here?

johscheuer commented 2 years ago

That special key-range will be available in 7.0 (I only see it in the release-7.0 branch and not the release-6.3)? Would it make more sense to read it directly from the database instead of adding a new fdbcli command?

sfc-gh-clin commented 2 years ago

That special key-range will be available in 7.0 (I only see it in the release-7.0 branch and not the release-6.3)? Would it make more sense to read it directly from the database instead of adding a new fdbcli command?

yeah, it's only available on 7.0 yeah, we can directly read it. Adding a command for that is just making it easier to remember(or maybe print more help text) if someone cannot remember the key range.