UMT: archive node runner notes re: hard fork documentation/process improvements

jrwashburn commented 6 months ago

Preliminary Checks

[X] This issue is not a duplicate. Before opening a new issue, please search existing issues: https://github.com/MinaProtocol/mina/issues
[X] This issue is not a question, feature request, RFC, or anything other than a bug report. Please post those things in GitHub Discussions: https://github.com/MinaProtocol/mina/discussions

Summary

Here are notes on where we can improve the documentation, process, etc., for the next test. I tried to include specific references and succeeded sometimes. I'm sorry they are not as organized or concise as I would like and do not always have helpful referents. I thought it better to get it all down somewhere before I lose/forget.

It would be helpful to arrange things in one main document, perhaps with links that can be updated in near real-time to point to builds, etc., as they are released. It was quite confusing and even now trying to update these notes, I'm not sure which document I'm referring to because it seems like there were so many different documents based on when/where information was released.

More clarity on which ledger json file to use with links to specific files, or specific directories. (e.g. mainnet.json vs. github link vs /var/lib/config_xyz.json, etc.) Note if specifying files like in /var/lib/config_x -- those are overwritten across builds so it is important to direct operator to copy the file as it is needed across builds. https://discord.com/channels/484437221055922177/1204059560684552253/1210002566599934012 I ended up getting bounced between a few different sources... still not sure what the best answer shoudl be. I ended up using /var/lib/coda/config_2025a732.json, but then had to downgrade at some point to get back to it because I didn't keep a copy. https://discord.com/channels/484437221055922177/1204059560684552253/1210304380683816981 https://discord.com/channels/484437221055922177/1204059560684552253/1210221864157188166 https://discord.com/channels/484437221055922177/1204059560684552253/1210154910629363733

incorrect parameter for node --mainnet-blocks-bucket should be --blocks-bucket https://discord.com/channels/484437221055922177/1204059560684552253/1212168290877706280

In node startup instructions: replace --log-json true with --log-json from mina-archive command (edited to replace instead of remove.) remove --internal-tracing and --file-log-rotations 500 from the node command

On the archive node instructions, I would expect to also include this for the daemon for trustless archive upgrade participants: --upload-blocks-to-gcloud true and env vars for GCLOUD_KEYFILE, NETWORK_NAME, and GCLOUD_BLOCK_UPLOAD_BUCKET

Document which port rosetta needs to point to on the node assuming defaults (graphql port) -- this should have been obvious to me but for some reason I didn't read the parameter name and was left wondering what port to provide.

Rosetta requires undocumented environment variable and error is not obvious. MINA_ROSETTA_MAX_DB_POOL_SIZE

Directions posted need to explain the parameters for the replayer: mina-replayer --archive-uri {db_connection_string} --input-file reference_replayer_input.json --output-file reference_replayer_output.json --checkpoint-interval 100 What is input-file and how is it created?

Replayer does not actually create --output-file when specified. https://github.com/MinaProtocol/mina/issues/15260

The command on this page (https://docs2-git-major-upgrade-minadocs.vercel.app/berkeley-upgrade/migrating-archive-database-to-berkeley) is invalid (extra closing quote) jq '.ledger.accounts' mainnet.json | jq '{genesis_ledger: {accounts: .}}' > replayer_input_config.json" and what is the source for mainnet.json?

Clarify migration / replayer incremental runs https://discord.com/channels/484437221055922177/1204059560684552253/1212508873089351701

Install depenedencies on jq, etc. https://github.com/MinaProtocol/mina/issues/15257

version conflict of mina-replayer - need version with --migration-mode https://discord.com/channels/484437221055922177/1204059560684552253/1212845720374214808

https://docs2-git-archivemigration-minadocs.vercel.app/berkeley-upgrade/migrating-archive-database-to-berkeley mina replayer --output-file option is referred to as --output-config https://docs2-git-archivemigration-minadocs.vercel.app/berkeley-upgrade/migrating-archive-database-to-berkeley#how-to-verify-a-successful-migration

Knowing when to transition from stage 2 to stage 3 of migration -- e.g. when to add --fork-state-hash was not very obvious. I kept running stage 2 expecting it to finish. (Reading ahead finally identified the solution for me.) https://discord.com/channels/484437221055922177/1204059560684552253/1212879528955875390
Stage/Phase language may be confusing, perhaps Stage / Step or Phase/Step would be more clear? Numbering in examples is off under 6. Stage 2: remainder migration (refs 5.a, instead of 6.a - made it difficult to refer)

Loading genesis block file on archive can take hours on a remote database https://github.com/MinaProtocol/mina/issues/15207

Combine zkapp_tables.sql with create_schema.sql https://discord.com/channels/484437221055922177/1204059560684552253/1212805656638398504 Also note that there can be schema confusion because most release notes include a reference to an archive schema so we had several different "archive schema" links to consider during the migration.

Mainnet issue - will be a problem for mainnet replayer https://github.com/MinaProtocol/mina/issues/15211

node stuck in catchup https://github.com/MinaProtocol/mina/issues/15206

Separate install package for mina-berkeley-migration https://discord.com/channels/484437221055922177/1204059560684552253/1212477934980440156 trying to overwrite '/usr/local/bin/mina-archive', which is also in package mina-archive 1.0.1umt-stop-slot-992168e

Important to start archive before the node when bringing up new forked network -- if not, archive can miss the first block and it is not stored in GCS. https://github.com/MinaProtocol/mina/issues/15261

Steps to Resolve this Issue

n/a

mrmr1993 commented 5 months ago

I've tried to break this down into sub-issues; hopefully these are representative.

[ ] Process: 1 doc with all the info in it
[ ] Process: Explicit ledger file links (GitHub?)
[x] Documentation: incorrect parameter --mainnet-blocks-bucket -> --blocks-bucket
[ ] Documentation: --log-json true -> --log-json
[ ] Documentation: remove --internal-tracing and --file-log-rotations 500
[ ] Documentation: --upload-blocks-to-gcloud true and env vars for GCLOUD_KEYFILE, NETWORK_NAME, and GCLOUD_BLOCK_UPLOAD_BUCKET
[ ] Documentation: rosetta node defaults
[ ] Documentation: required rosetta variable MINA_ROSETTA_MAX_DB_POOL_SIZE
[x] Documentation: what is --input-file for mina-replayer
[ ] Investigate/fix: Replayer doesn't create --output-file: #15260
[x] Documentation: fix invalid jq command in docs: https://docs2-git-major-upgrade-minadocs.vercel.app/berkeley-upgrade/migrating-archive-database-to-berkeley
[x] Documentation: where does mainnet.json come from in https://docs2-git-major-upgrade-minadocs.vercel.app/berkeley-upgrade/migrating-archive-database-to-berkeley jq command
[x] Documentation: incorrect flag --output-config -> --output-file (https://docs2-git-archivemigration-minadocs.vercel.app/berkeley-upgrade/migrating-archive-database-to-berkeley#how-to-verify-a-successful-migration)
[x] Documentation: when to use --fork-state-hash (stage 2 -> stage 3) (https://discord.com/channels/484437221055922177/1204059560684552253/1212879528955875390)
[ ] Documentation: stage/phase/step terminology is potentially unclear
[x] Performance fix: Loading genesis block on archive can take hours (https://github.com/MinaProtocol/mina/issues/15207)
[ ] Documentation: need to always share zkapp_tables.sql and create_schema.sql together (with Github links)
[ ] Bug: OOM in hosted postgres #15211
[ ] Bug: stuck in catchup against UMT #15206
[ ] Bug: parallel installs clobber the same file (https://discord.com/channels/484437221055922177/1204059560684552253/1212477934980440156). (Probably document not to do that?)
[ ] Documentation: bring up archive nodes before new network starts #15261.

jrwashburn commented 5 months ago

This is much better! Sorry I didn't spend more time to simplify it initially.

One comment re:

Bug: parallel installs clobber the same file

I think the problem is that we needed different components from different installs at the same time - e.g. needed to stay on pre-fork archiver but run tooling that wasn't in that build. (Or something like that.) So I don't think it's just document not to do that, something needs to change to separate the applications in the builds.

jrwashburn commented 5 months ago

[ ] Documentation: bring up archive nodes before new network starts UMT first block after hard fork is not stored to GCS and may be missed if archive is not started first #15261.

I think the preferred solution to this would be that the block would be sent to the precomputed blocks storage. There are probably many reasons why that is a better idea, and I think that after k blocks this will be lost otherwise?

MinaProtocol / mina