FC-1425: Fix/raft startup missing ledger

zonotope commented 2 years ago

This patch adds a :v2-migrate startup command that incorporates the existing :reindex command as well as a new process to rewrite raft log files to replace all instances of :new-db entries with :new-ledger, and all entries of :initialized-db entries with :initialized-ledger. This migration will allow ledgers created before we converted the new-db/initialized-db state machine events to new-ledger/initialized-ledger to start up successfully with the latest Fluree ledger.

I spent a good amount of time trying to automatically detect when this migration should be run so we can give a more helpful error message to the user, but any solution I could come up with involved validating the raft logs at each start up, which would be too slow, or would hold up the ledger group startup process.

I also tried to hook into our existing upgrade process for data format changes, but this also did not work because the raft group starts in a different thread and, by the time the logs were migrated, the raft group already encountered the error and shut down.

The only changes relevant to FC-1425 are in the fluree.db.server and fluree.db.ledger.upgrade.raft-db-ledger namespaces. Everything else is cleanup refactors that I did while doing my initial experiments and are not essential. I left those changes in since I had already done them, but I'm happy to back them out and put them in a different PR if anyone thinks that's better.

bplatz commented 2 years ago

I still wonder if we should support both the 'db' and 'ledger' commands here. It seems not only easier (although the work is done now), but a better user experience. With all new commands coming in with the 'ledger', it means future updates don't need to support the old 'db' command names, and there would also be no need to upgrade.

Part of the issue I'm concerned about is that the user must run this upgrade on every server in a raft group. That could require a lot of work, and it also requires additional down time - it cannot be done while running.

At minimum if we detect an upgrade is needed, we should immediately exit and notify the user they must use :v2-upgrade, they need to first shut down all servers, and then run it on every server. I didn't see where we would shut down if needed and notify the user of this (I might have missed it somewhere though). I think if this stays, we should at least add this message.

bplatz commented 2 years ago

Sorry, didn't mean to close!

zonotope commented 2 years ago

I still wonder if we should support both the 'db' and 'ledger' commands here. It seems not only easier (although the work is done now), but a better user experience. With all new commands coming in with the 'ledger', it means future updates don't need to support the old 'db' command names, and there would also be no need to upgrade.

Part of the issue I'm concerned about is that the user must run this upgrade on every server in a raft group. That could require a lot of work, and it also requires additional down time - it cannot be done while running.

yeah, that's kind of what i was getting at on the call. i had initially tried to do this on startup using the existing upgrade process, but that didn't work because we have to update the raft logs in place just as the servers are reading them in to rebuild the state. i couldn't figure out how to do it without shutting everything down first.

At minimum if we detect an upgrade is needed, we should immediately exit and notify the user they must use :v2-upgrade, they need to first shut down all servers, and then run it on every server. I didn't see where we would shut down if needed and notify the user of this (I might have missed it somewhere though). I think if this stays, we should at least add this message.

i also couldn't figure out a good way to detect an upgrade. i think we'd have to either scan all the raft files before any startup, which i think would be unacceptably slow for large databases, especially since the case of needing a migration would be much less frequent than the normal case of not needing one.

We could propagate errors from the raft event loop up through the tx-group object to the top level server process where we could then catch any error and shut down if necessary. That may be worthwhile to do in general as now there isn't any way for the server to know when the event loop has an error, but I now think there still might be more compelling arguments for just adding back the :new-db/:initialized-db events and permissively accepting both the *-db and *-ledger variants in the state machine.

bplatz commented 2 years ago

just adding back the :new-db/:initialized-db events and permissively accepting both the -db and -ledger variants in the state machine

I think so. And provided all new commands are using *-ledger we can drop the *-db with a high degree of safety in the next major version.

Re: catching event loop errors, I think this is something we should definitely get in - this has bitten us in the past and I expect will again. I think this should be something we consider a terminal error... and if there are 'safe' errors in the event loop they can always be caught and logged vs bubbling up. Might warrant a discussion though to go through possible scenarios to ensure this behavior is what makes sense.

zonotope commented 2 years ago

ok, i've added back the -db events alongside the -ledger ones and removed the raft log migration. the ledger should start up seamlessly whether -db or -ledger events are in the logs.

I think so. And provided all new commands are using *-ledger we can drop the *-db with a high degree of safety in the next major version.

this still doesn't rewrite the log files, so we'd have to leave these events in forever unless we have a separate migration to re-write the logs in the background as well, at least for ledgers that stick around that long.

Re: catching event loop errors, I think this is something we should definitely get in - this has bitten us in the past and I expect will again. I think this should be something we consider a terminal error... and if there are 'safe' errors in the event loop they can always be caught and logged vs bubbling up. Might warrant a discussion though to go through possible scenarios to ensure this behavior is what makes sense.

i'll write up a separate ticket for this.

cap10morgan commented 2 years ago

this still doesn't rewrite the log files, so we'd have to leave these events in forever unless we have a separate migration to re-write the logs in the background as well, at least for ledgers that stick around that long.

Are we sure it wouldn't be better to let 2.0 be the clean break for this stuff? Especially since the upgrade code has already been written.

jakep36 commented 2 years ago

Users upgrading from 1 to 2 still need to run :reindex while the ledger servers are down, correct? If so, I wonder if re-writing the raft files is significant in that case. Our Nexus ledgers are probably among the few out there that are post-TSPO changes, but pre 2.0 db -> ledger changes.

zonotope commented 2 years ago

Users upgrading from 1 to 2 still need to run :reindex while the ledger servers are down, correct? If so, I wonder if re-writing the raft files is significant in that case. Our Nexus ledgers are probably among the few out there that are post-TSPO changes, but pre 2.0 db -> ledger changes.

I think the issue is that the reindex process only has to be run on the leader, while the raft log migration needs to be run on all the servers because we can't rely on the raft group to communicate the changes to the historical logs. This makes me realize that we also need to rethink the process of combining these migrations into one v2-upgrade command because these different migrations need to be run in different contexts.

jakep36 commented 2 years ago

I think the issue is that the reindex process only has to be run on the leader

I didn't realize that could be run on just the leader. Do you have to bring down the leader to do it? If so, won't another node get elected leader, and how does that resolve? (sorry if this is tangential to the main question, just didn't know where else to ask)

zonotope commented 2 years ago

I think the issue is that the reindex process only has to be run on the leader

I didn't realize that could be run on just the leader. Do you have to bring down the leader to do it? If so, won't another node get elected leader, and how does that resolve? (sorry if this is tangential to the main question, just didn't know where else to ask)

Actually, I think I was mistaken there. I thought the leader made the index files available to the rest of the raft group alongside the block files, but it looks like it's just the block files themselves and each server maintains their own index from those. If that's the case, we do have to run the reindex process on all the servers too.

bplatz commented 2 years ago

A few comments re: prior comments.

Raft logs are not permanent, they are temporary... like hours temporary. If Fluree itself is the only one issuing the respective commands/messages, by issuing *-ledger starting now it means that we can loose support for *-db safely in any future version.

The current code works, but is only part of the solution as it would have to be run on all servers while all were down. So we either have more work to do (at minimum with some good documentation and customer communication), or if we simply support both commands for now then drop support later as per my above suggestion.

@jakep36 As it stands right now I believe it still needs to be run on every server. It could be run over raft but that's not how it was implemented as far as I know. To do it over raft we'd have a new leader, at startup, check if they are running and old version then issue some sort of :maintenance command across the network so the other servers did not yet accept requests, and then it could reindex and broadcast the new files. I think that would be a lot of work... debatable in my opinion if it is worth it for hopefully a one-time thing. I also think with JSON-LD we'll really be encouraging new folks to not use the old formats - and if so we'll need some sort of rebuild process anyhow so perhaps reindexing can even go away if we are only allowing "re-creating" a ledger in the new formats. I'd prefer not to mess with what is there until we get more clarity on that situation.

zonotope commented 2 years ago

Raft logs are not permanent, they are temporary... like hours temporary. If Fluree itself is the only one issuing the respective commands/messages, by issuing *-ledger starting now it means that we can loose support for *-db safely in any future version.

Thanks for clearing this up @bplatz. I think this settles things. I have already backed out the migration and added support for the -db events alongside the -ledger ones, and I'll add a TODO to remove them in a later version. I will also merge this branch early tomorrow if there are no more objections.

fluree / ledger

FC-1425: Fix/raft startup missing ledger #176