Netflix / mantis

A platform that makes it easy for developers to build realtime, cost-effective, operations-focused applications
Apache License 2.0
1.42k stars 202 forks source link

Adding DynamoDB high availability features #675

Closed mcowgill-stripe closed 1 month ago

mcowgill-stripe commented 5 months ago

Context

The main purpose of this change is to enable the ability for external consumers to leverage high availability features without needing ZooKeeper and to be able to do that with configuration instead of needing to modify source code or re-implement logic that is available here today.

The first part of this is creating an extension to use DynamoDB for leader election and leader monitoring. We had submitted prior work from @crioux-stripe and myself to add a KV Store for DynamoDB. That started to divide the extension for a "store" project. Looking at what we'd have to add to separate the store from the leader election/monitoring it makes more sense to create an extension for just DynamoDB and include the store, leader election, and leader monitoring. In theory the only difference is the size of the dependencies and any of the three bring with it roughly the same size. However, if the maintainers have different opinions I'm happy to pivot.

Looking at the code today and projects/docs. The idea seems to be that mantis-control-plane-core is a common dependency between the server and client. This does have exceptions, but seems like it's the intent. With that in mind I designed to have the DynamoDB extension only takea dependency on core and I deprecated and moved things into core to enable that. If that's not the goal/intent of the design today, I'm happy to move all that back.

In order to finish this goal of providing a way to inject dependencies for DynamoDB there are additional changes that need to happen in MasterMain.java for mantis server and HighAvailabilityServices inside of the client configurations. The latter one is more blocking for our internal use case today, since it's harder to unwind the different client paths than it is to re-implement MasterMain.

In order to make sure I don't go down paths that the maintainers won't accept, I'm opening the PR with the initial DynamoDB code for leader election and leader monitoring. The minimal deprecations and refactoring needed to manage the dependency flow. I am only commenting near the code blocks to add injection. Should I continue in this PR for those sections or should I try to finish and merge this and return to refactor the design for those two code paths (MasterMain and HighAvailabilityServicesUtil).

Checklist

Andyz26 commented 5 months ago

@mcowgill-stripe overll lgtm. I see you still have some todo/comments so ping me when you think this is ready for merge.

github-actions[bot] commented 5 months ago

Test Results

533 tests  +13   527 :white_check_mark: +13   7m 52s :stopwatch: -15s 139 suites + 4     6 :zzz: ± 0  139 files   + 4     0 :x: ± 0 

Results for commit 8744c4ef. ± Comparison against base commit d773089b.

This pull request removes 1 and adds 14 tests. Note that renamed tests count towards both. ``` io.mantisrx.server.master.store.KeyValueStoreTest ‑ testUpsertOrdered ``` ``` io.mantisrx.extensions.dynamodb.DynamoDBClientSingletonTest ‑ highAvailabilityServices io.mantisrx.extensions.dynamodb.DynamoDBLeaderElectorTest ‑ becomesCurrentLeader io.mantisrx.extensions.dynamodb.DynamoDBLeaderElectorTest ‑ respectsExistingLock io.mantisrx.extensions.dynamodb.DynamoDBLeaderElectorTest ‑ restartsOnFailures io.mantisrx.extensions.dynamodb.DynamoDBLeaderElectorTest ‑ shutdownAsLeader io.mantisrx.extensions.dynamodb.DynamoDBLeaderElectorTest ‑ shutdownNotAsLeader io.mantisrx.extensions.dynamodb.DynamoDBMasterMonitorTest ‑ getCurrentLeader io.mantisrx.extensions.dynamodb.DynamoDBMasterMonitorTest ‑ runShutdown io.mantisrx.extensions.dynamodb.DynamoDBStoreTest ‑ testInsertAndDelete io.mantisrx.extensions.dynamodb.DynamoDBStoreTest ‑ testInsertAndDeleteMoreThan25 … ```

:recycle: This comment has been updated with latest results.

mcowgill-stripe commented 5 months ago

@Andyz26 - I've broken something in container tests it appears. It's not obvious to me what it is yet. I am trying to slowly walk things backwards until I can make this test pass again. If you have a suggestion that would probably accelerate me. If not I'll keep trying to bisect the change.

Regarding the TODO's this PR had reached a size where I wanted to pause and check in. We could merge these changes and open a new PR to complete those TODOs (assuming I fix the test issue above). Or I could finish proposing the change to enable injection for the high availability services.

Do you have a preference on the continuing this PR or trying to reach a merge point and separating into a new PR?

mcowgill-stripe commented 5 months ago

I am not sure if I ran into a class erasure issue with the naming or not. After finding that this commit passed locally, I chose to rename the KeyValueStore to IKeyValueStore to reduce that possibility.

Andyz26 commented 5 months ago

@Andyz26 - I've broken something in container tests it appears. It's not obvious to me what it is yet. I am trying to slowly walk things backwards until I can make this test pass again. If you have a suggestion that would probably accelerate me. If not I'll keep trying to bisect the change.

Regarding the TODO's this PR had reached a size where I wanted to pause and check in. We could merge these changes and open a new PR to complete those TODOs (assuming I fix the test issue above). Or I could finish proposing the change to enable injection for the high availability services.

Do you have a preference on the continuing this PR or trying to reach a merge point and separating into a new PR?

IMO, as long as nothing is broken, it's up to you whether you prefer one PR or multiple. For test errors, the test container can be debugged if you put a breakpoint or something and examine the error in your local Docker app, too.

crioux-stripe commented 5 months ago

Kicking the CI workflow here to see where we're at.

We debugged the failing container test last week. It was an unfortunate issue of our linter bullying Mike into making a variable final that shouldn't have been. Meeting with Kevin here at Stripe shortly to hand-off merging this. It should be a good (though large) onboarding task for Mantis infrastructure.

mcowgill-stripe commented 4 months ago

@Andyz26 - I've removed the class based injection and reverted to a simple string. Locally, I have the build passing. Are there other places I need to make changes?

Andyz26 commented 4 months ago

@Andyz26 - I've removed the class based injection and reverted to a simple string. Locally, I have the build passing. Are there other places I need to make changes?

hi @mcowgill-stripe did you get a chance to look at the discussion thread from @sundargates. He made some proposals regarding these interfaces.

crioux-stripe commented 4 months ago

@Andyz26 @sundargates How would we feel about merging this, and we can have someone take an immediate task to do the interface refactor? This is the last blocker for us going to production before the end of Q2. We can work on the prod rollout and the refactor at the same time.

Andyz26 commented 4 months ago

@Andyz26 @sundargates How would we feel about merging this, and we can have someone take an immediate task to do the interface refactor? This is the last blocker for us going to production before the end of Q2. We can work on the prod rollout and the refactor at the same time.

sure that's fine. Let me try a snapshot build to do some validation internally.

crioux-stripe commented 4 months ago

The snapshot error is interesting:

https://github.com/Netflix/mantis/blob/master/build.gradle#L163

[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.

FAILURE: Build completed with 36 failures.

1: Task failed with an exception.
-----------
* Where:
Build file '/home/runner/work/mantis/mantis/build.gradle' line: 163

* What went wrong:
Execution failed for task ':mantis-discovery-proto:printReleasedArtifact'.
> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.

* Try:
> Run with --debug option to get more log output.

* Exception is:
org.gradle.api.tasks.TaskExecutionException: Execution failed for task ':mantis-discovery-proto:printReleasedArtifact'.

Looking into this.

Andyz26 commented 4 months ago

The snapshot error is interesting:

https://github.com/Netflix/mantis/blob/master/build.gradle#L163

[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.

FAILURE: Build completed with 36 failures.

1: Task failed with an exception.
-----------
* Where:
Build file '/home/runner/work/mantis/mantis/build.gradle' line: 163

* What went wrong:
Execution failed for task ':mantis-discovery-proto:printReleasedArtifact'.
> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.

* Try:
> Run with --debug option to get more log output.

* Exception is:
org.gradle.api.tasks.TaskExecutionException: Execution failed for task ':mantis-discovery-proto:printReleasedArtifact'.

Looking into this.

this seems to be only happening on forked repo's branch. I've added both of you to the collaborator list so you can try push ing the branch in the main repo and snapshot build from there.

kmg-stripe commented 4 months ago

The snapshot error is interesting: https://github.com/Netflix/mantis/blob/master/build.gradle#L163

[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.

FAILURE: Build completed with 36 failures.

1: Task failed with an exception.
-----------
* Where:
Build file '/home/runner/work/mantis/mantis/build.gradle' line: 163

* What went wrong:
Execution failed for task ':mantis-discovery-proto:printReleasedArtifact'.
> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.

* Try:
> Run with --debug option to get more log output.

* Exception is:
org.gradle.api.tasks.TaskExecutionException: Execution failed for task ':mantis-discovery-proto:printReleasedArtifact'.

Looking into this.

this seems to be only happening on forked repo's branch. I've added both of you to the collaborator list so you can try push ing the branch in the main repo and snapshot build from there.

@Andyz26 - @mcowgill-stripe and @crioux-stripe are out for the next few weeks. to ensure we make progress on this, i am stepping in to help out here. could you please add me to the collaborator list? thanks!

Andyz26 commented 4 months ago

The snapshot error is interesting: https://github.com/Netflix/mantis/blob/master/build.gradle#L163

[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.

FAILURE: Build completed with 36 failures.

1: Task failed with an exception.
-----------
* Where:
Build file '/home/runner/work/mantis/mantis/build.gradle' line: 163

* What went wrong:
Execution failed for task ':mantis-discovery-proto:printReleasedArtifact'.
> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.

* Try:
> Run with --debug option to get more log output.

* Exception is:
org.gradle.api.tasks.TaskExecutionException: Execution failed for task ':mantis-discovery-proto:printReleasedArtifact'.

Looking into this.

this seems to be only happening on forked repo's branch. I've added both of you to the collaborator list so you can try push ing the branch in the main repo and snapshot build from there.

@Andyz26 - @mcowgill-stripe and @crioux-stripe are out for the next few weeks. to ensure we make progress on this, i am stepping in to help out here. could you please add me to the collaborator list? thanks!

@kmg-stripe could you ask cody to send me an email to verify your github id? Sorry just to be cautious as the write permission is pretty powerful.

kmg-stripe commented 4 months ago

The snapshot error is interesting: https://github.com/Netflix/mantis/blob/master/build.gradle#L163

[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.
[Fatal Error] :1:1: Content is not allowed in prolog.

FAILURE: Build completed with 36 failures.

1: Task failed with an exception.
-----------
* Where:
Build file '/home/runner/work/mantis/mantis/build.gradle' line: 163

* What went wrong:
Execution failed for task ':mantis-discovery-proto:printReleasedArtifact'.
> org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.

* Try:
> Run with --debug option to get more log output.

* Exception is:
org.gradle.api.tasks.TaskExecutionException: Execution failed for task ':mantis-discovery-proto:printReleasedArtifact'.

Looking into this.

this seems to be only happening on forked repo's branch. I've added both of you to the collaborator list so you can try push ing the branch in the main repo and snapshot build from there.

@Andyz26 - @mcowgill-stripe and @crioux-stripe are out for the next few weeks. to ensure we make progress on this, i am stepping in to help out here. could you please add me to the collaborator list? thanks!

@kmg-stripe could you ask cody to send me an email to verify your github id? Sorry just to be cautious as the write permission is pretty powerful.

@Andyz26 understood. i just sent cody a message. he is OOO for a few weeks, so hoping he happens to see it while he is out.

crioux-stripe commented 1 month ago

Closing, as @kmg-stripe merged in a version of @mcowgill-stripe 's work in https://github.com/Netflix/mantis/pull/682