Netflix / Priam

Co-Process for backup/recovery, Token Management, and Centralized Configuration management for Cassandra.
Apache License 2.0
1.03k stars 294 forks source link

Priam should order stop and start #753

Open hashbrowncipher opened 5 years ago

hashbrowncipher commented 5 years ago

Describe the bug It is possible for a sequence of start and stop API calls to leave Cassandra in a state not matching the last call made.

To Reproduce Call /cassadmin/stop (don't wait for it to return) Call /cassadmin/start

Observed behavior Cassandra is left down.

Expected behavior Either of: a) Cassandra stops but immediately starts again b) The pending stop is cancelled.

Version: Priam 3.1.63

arunagrawal84 commented 5 years ago

We could have 2 ways to solve this problem:

  1. Have a queue ahead of Cassandra start/stop operations and then deque them.
  2. Have a lock which will throw an error for simultaneous operations.

We have taken the second approach when operator issues multiple cluster management tasks: flush, compactions, snapshots etc. The expectation is to ensure operator/script will wait for one operation to finish before executing other operation of similar type. e.x. the operator cannot execute two compactions but one flush and one compaction are ok.

In the above context, it is technically not "similar operation" but "dependent operations". I would like to throw an exception instead of enqueueing operations as then which operation came "first" is something that operator will need to know. I like when it is simple that second request just fails saying - something is running. Try later!

Thoughts?

hashbrowncipher commented 5 years ago

I think that there should be a flag for the desired state (which already exists, iirc), and a thread to bring the state of the world into harmony with the desired state. Whenever the thread finishes an operation (be it stop or start), it should check "has the flag changed?" and if so, begin the loop again.