cosmos / iavl

Merkleized IAVL+ Tree implementation in Go
Apache License 2.0
423 stars 264 forks source link

Restoring state-sync snapshot takes a long time #698

Open webmaster128 opened 1 year ago

webmaster128 commented 1 year ago

I recently started to play with using state-sync to create a fresh node in the Nois testnet. There we observed that applying the last chunk of the snapshot takes approximately 1 hour and requires something between 13 and 14 GB of memory. As many community members consider this unexpectedly long, I though it's worth debugging what's going on here.

Network

Snapshot

System and utalization

Logs

The following logs show how the snapshot is downloaded easily, the fist 17 chunks are applied within 1 minute each and the last chunk takes 62 minutes to apply.

Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="�Bh�!/\x16/��T�b�d}o\x06��oO�ٛ\f\x1b(n���" height=3404000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="�ܸ\vU\x15���>Vx��n�ʜ|���\x1a\x18�����V�F" height=3402000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash=" K�֍�T�\x11�u���\x1bUch�������w�6N+�\v" height=3400000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="���w!�7�֗�e��bDBd'�=\x1aJ�F��\b�4�T" height=3398000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="\x01ƹ\\'��\x17+�f�X�i�Y�,�S,p:3~Ҧ�g\x1b�" height=3396000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="��,�\x02��\aj�\t\x7f�����7��G4\x16�m�01ʖȟ" height=3236000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="q�QI�}\x04_D�����S�{�I�\x19�dA� *�-g\n�" height=3234000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="����{\x02\"��6sm\x01�p\x05~҂�f+\x00O4�&\x04��c�" height=2508000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="O�(�t��B��FI\f*3��J��v�\x01\x1dIz8��\x1e��" height=3400500 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="\x18�p�����~�f �@�q��mב�\x17��ֈ�[�I�" height=3397500 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="��\a�Df\x03c\x0f����ԑ�\x11\x0e)\x1d>�I���J\x1c�\x1c�h" height=2914000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="=[aO�+ʤ�$���MA����!-:\b�D��� ػ�" height=2904000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF VerifyHeader hash=38608248F22723878BC0C396849491FADCDA9FD620E10CA47DCA93BC2E007046 height=3404001 module=light
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF VerifyHeader hash=A0A5D1A7BF2AF5B58E6E8414A0830E1DD105E33E7EF471452DE8058DA9EC535A height=3404002 module=light
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Offering snapshot to ABCI app format=1 hash="�Bh�!/\x16/��T�b�d}o\x06��oO�ٛ\f\x1b(n���" height=3404000 module=statesync
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Snapshot accepted, restoring format=1 hash="�Bh�!/\x16/��T�b�d}o\x06��oO�ٛ\f\x1b(n���" height=3404000 module=statesync
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Fetching snapshot chunk chunk=2 format=1 height=3404000 module=statesync total=18
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Fetching snapshot chunk chunk=0 format=1 height=3404000 module=statesync total=18
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Fetching snapshot chunk chunk=1 format=1 height=3404000 module=statesync total=18
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Fetching snapshot chunk chunk=3 format=1 height=3404000 module=statesync total=18
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF VerifyHeader hash=1152E87653548B640646D894483B3177A923A7B91BD687F0BA585DBCED8A1E83 height=3404000 module=light
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Header has already been verified hash=38608248F22723878BC0C396849491FADCDA9FD620E10CA47DCA93BC2E007046 height=3404001 module=light
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Header has already been verified hash=A0A5D1A7BF2AF5B58E6E8414A0830E1DD105E33E7EF471452DE8058DA9EC535A height=3404002 module=light
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Header has already been verified hash=38608248F22723878BC0C396849491FADCDA9FD620E10CA47DCA93BC2E007046 height=3404001 module=light
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Header has already been verified hash=1152E87653548B640646D894483B3177A923A7B91BD687F0BA585DBCED8A1E83 height=3404000 module=light
Jan 22 19:48:00 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=4 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:02 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=5 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:02 syncer noisd[7744]: 7:48PM INF Applied snapshot chunk to ABCI app chunk=0 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:04 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=6 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:04 syncer noisd[7744]: 7:48PM INF Applied snapshot chunk to ABCI app chunk=1 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:04 syncer noisd[7744]: 7:48PM INF Applied snapshot chunk to ABCI app chunk=2 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:06 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=7 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:06 syncer noisd[7744]: 7:48PM INF Applied snapshot chunk to ABCI app chunk=3 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:08 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=8 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:08 syncer noisd[7744]: 7:48PM INF Applied snapshot chunk to ABCI app chunk=4 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:10 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=9 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:10 syncer noisd[7744]: 7:48PM INF Applied snapshot chunk to ABCI app chunk=5 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:12 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=10 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:13 syncer noisd[7744]: 7:48PM INF Applied snapshot chunk to ABCI app chunk=6 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:14 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=11 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:16 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=12 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:18 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=13 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:20 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=14 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:22 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=15 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:24 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=16 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:25 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=17 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:55 syncer noisd[7744]: 7:48PM INF Discovered new snapshot format=1 hash="�b%�\x02t����\\\x18�۾��D�\x12-���h���t���" height=3300000 module=statesync
Jan 22 19:49:23 syncer noisd[7744]: 7:49PM INF Applied snapshot chunk to ABCI app chunk=7 format=1 height=3404000 module=statesync total=18
Jan 22 19:49:27 syncer noisd[7744]: 7:49PM INF Discovered new snapshot format=1 hash="\x1c\n�w�E2.T�5�ٽX\x0f`�����iV��ҶZ=\x06\n" height=3348000 module=statesync
Jan 22 19:49:27 syncer noisd[7744]: 7:49PM INF Discovered new snapshot format=1 hash="��\fd�H\x17(��3�5\x04ME�|\x1b%�à�\x04-G��\x1cǫ" height=3270000 module=statesync
Jan 22 19:49:27 syncer noisd[7744]: 7:49PM INF Discovered new snapshot format=1 hash="�u���.S\t��>\x04�pk����D]��Q\x14W\b�G�M�" height=3268000 module=statesync
Jan 22 19:49:27 syncer noisd[7744]: 7:49PM INF Discovered new snapshot format=1 hash="�\f\x1c\x1c���W̪Kf}�^�����0�5�!��n\x13�d\x11" height=3266000 module=statesync
Jan 22 19:49:27 syncer noisd[7744]: 7:49PM INF Discovered new snapshot format=1 hash="��\x14��ݡ\x10�\x1fǺ@_\x02l��y^�\x17\x1d,\x14��Q�'�k" height=3264000 module=statesync
Jan 22 19:49:32 syncer noisd[7744]: 7:49PM INF Applied snapshot chunk to ABCI app chunk=8 format=1 height=3404000 module=statesync total=18
Jan 22 19:49:34 syncer noisd[7744]: 7:49PM INF Applied snapshot chunk to ABCI app chunk=9 format=1 height=3404000 module=statesync total=18
Jan 22 19:49:59 syncer noisd[7744]: 7:49PM INF Applied snapshot chunk to ABCI app chunk=10 format=1 height=3404000 module=statesync total=18
Jan 22 19:50:33 syncer noisd[7744]: 7:50PM INF Applied snapshot chunk to ABCI app chunk=11 format=1 height=3404000 module=statesync total=18
Jan 22 19:51:12 syncer noisd[7744]: 7:51PM INF Applied snapshot chunk to ABCI app chunk=12 format=1 height=3404000 module=statesync total=18
Jan 22 19:51:50 syncer noisd[7744]: 7:51PM INF Applied snapshot chunk to ABCI app chunk=13 format=1 height=3404000 module=statesync total=18
Jan 22 19:52:47 syncer noisd[7744]: 7:52PM INF Applied snapshot chunk to ABCI app chunk=14 format=1 height=3404000 module=statesync total=18
Jan 22 19:53:40 syncer noisd[7744]: 7:53PM INF Applied snapshot chunk to ABCI app chunk=15 format=1 height=3404000 module=statesync total=18
Jan 22 19:54:26 syncer noisd[7744]: 7:54PM INF Applied snapshot chunk to ABCI app chunk=16 format=1 height=3404000 module=statesync total=18
Jan 22 20:08:27 syncer noisd[7744]: 8:08PM INF Discovered new snapshot format=1 hash="\"�\r-�d�\b�>�)������\r\x04\x11��\a�\x11�N��ȡ" height=3392000 module=statesync
Jan 22 20:08:27 syncer noisd[7744]: 8:08PM INF Discovered new snapshot format=1 hash="��Uޔ�s��-\x03���\x1bV\x1a��㢵!�T\x05L���\"\x7f" height=3388000 module=statesync
Jan 22 20:08:27 syncer noisd[7744]: 8:08PM INF Discovered new snapshot format=1 hash="HF�}\x13ˉ���uh=�����J�v��҂�۰\a!@Q" height=3384000 module=statesync
Jan 22 20:56:09 syncer noisd[7744]: 8:56PM INF Applied snapshot chunk to ABCI app chunk=17 format=1 height=3404000 module=statesync total=18
Jan 22 20:56:09 syncer noisd[7744]: 8:56PM INF Verified ABCI app appHash="�\\�oEا(b�\af���fԅ��\x1e�����Y�m���" height=3404000 module=statesync
Jan 22 20:56:09 syncer noisd[7744]: 8:56PM INF Snapshot restored format=1 hash="�Bh�!/\x16/��T�b�d}o\x06��oO�ٛ\f\x1b(n���" height=3404000 module=statesync

Data/wasm directory

Profiling

Some initial profiling data provided by @jhernandezb shows a lot of compating activity. Maybe he can elaborate more on the profiling of things.

Screenshot_2023-01-21_at_10 27 14_AM Screenshot_2023-01-21_at_10 27 39_AM

Other notes

My gut feeling here is that at a certain number of database elements, things become slow. But I am far from a Go profiling expert or database expert and don't know how to debug this further. For us it is not a big deal right now and we can try to use less KV elements for the app. But maybe this is a good chance to prevent other mainnets running into problems.

PolkachuIntern commented 1 year ago

Our state-sync scripts run every Sunday, so this issue is rather timely. Nois testnet state-sync took a bit over an hour today. In contrast, Stargaze mainnet state-sync took about 10 minutes. I am not knowledgable enough to articulate the reasons behind it. As Simon mentioned, Evmos and Sei state-sync is also slow and at times unreliable.

kaisbaccour commented 1 year ago

Would it make sense to make the inventory of the number of application.db KV entries for all these networks and some other very busy networks to say if the rule stands or no?

jhernandezb commented 1 year ago

Adding profiles: cpu.pb.gz mem.pb.gz

Binary can be built using: https://github.com/noislabs/full-node/tree/main

One thing to mention is that stargaze still has fast node disabled by default (will be changed to enabled in our next upgrade).

Seems that fast node is contributing to more cpu cycles and garbage collection

tac0turtle commented 1 year ago

there are a couple issue in iavl to assist with this. Here is one pr that aims to increase the speed. https://github.com/cosmos/iavl/pull/664.

This issue could be to how large your current state is? ill transfer this issue to iavl so it can be focused on where the actual problem may lie. Lots of the compacting is being cleaned up as we speak.

tac0turtle commented 1 year ago

with iavlv1 this should be significantly faster. Would you want to test with this version?