Node crashing after upgrade to 13.3

k3vmcd commented 2 months ago

Did you read the documentation and guides?

[X] I have inspected the documentation.

Is there an existing issue?

[X] I have searched the existing issues.

Description of the problem

Since upgrading to 13.3 my validator node is missing sessions, crashing frequently, and locking up such that it won't even open an SSH connection (e.g., crash during session 85418 and missed next 5 sessions until I hard-reset the VM). In between reboots, the node validates successfully. Occasionally it will self recover, such as when it disconnected and missed during session 85487 and then self-recovered in session 85491 and then ran fine without intervention until it missed blocks in session 85557 and then crashed and missed session 85561.

In the logs I get a lot of failed verifications and then it bans all its peers and disconnects from the network. This seems to be the cause of the node eventually crashing.

2024-04-21 01:47:55.135  WARN tokio-runtime-worker sync: 💔 Verification failed for block 0xf0abcd769c4190c1290fb5ad87f19cec790bbb4d4afc0b83bb22a537b5b7c09a: "Header 0xf0abcd769c4190c1290fb5ad87f19cec790bbb4d4afc0b83bb22a537b5b7c09a rejected>
2024-04-21 01:47:55.136  WARN tokio-runtime-worker sync: 💔 Error importing block 0x92faffba1a1f22f277ba71708637bb4d37ba9244783abdc3f6ea94418073f764: block has an unknown parent
2024-04-21 01:47:55.136  WARN tokio-runtime-worker sync: 💔 Error importing block 0xc78164fab4bc54804b1d32a22a3ea994ad3f8ed6d3f163401b0877c27444e92f: block has an unknown parent

2024-04-21 01:48:30.905 WARN tokio-runtime-worker aleph-block-sync: Error handling request from PeerId("12D3KooWFi2pXEHAsujrfJ7E5ypAn8cgaKEdwnMgjyb8EJy1aznK"): forest error: block is too new.

0x991549ffa8edb1f55d4f79d52d50538d418429bb5897dee5c6c01a809ccaa2c1 received from (12D3KooWJhNF2g88D3zD6CL9fQNdpYQ92rrQ4kWomLHXfQ34tVoq): "Header 0x9915>
2024-04-21 01:48:52.604  WARN tokio-runtime-worker sync: 💔 Verification failed for block 0x991549ffa8edb1f55d4f79d52d50538d418429bb5897dee5c6c01a809ccaa2c1 received from (12D3KooWEF1Eo7uFZWdqFsTPP7CehpRt5NeXFwCe3157qpoU5aqd): "Header 0x9915>
2024-04-21 01:48:53.070  WARN tokio-runtime-worker peerset: Report 12D3KooWQZBB3BCWGVFRdJnDiFfJdemV7DdQb1i5URsApTkfSp5k: -2147483648 to -2147483648. Reason: Genesis mismatch. Banned, disconnecting.
2024-04-21 01:48:53.115  WARN tokio-runtime-worker peerset: Report 12D3KooWD1Y4vtBnJBBGXgdSYCco1zq8eUJwSPBiQbhjCefLPRuV: -2147483648 to -2147483648. Reason: Genesis mismatch. Banned, disconnecting.

During crashes where I have to hard-reset, the logs are unreadable with the error:

error from daemon in stream: Error grabbing logs: invalid character '\x00' in string literal This appears to be when the docker log is hard-interrupted as it is reported separately in docker issues, and I had to manually remove the offending line from the log. This may indicate the node process is still hung when I hard-reset the VM.

In attached log, session 85486 is where it shows errors that lead to disconnecting from all peers as it enters the active session 85487. Then, session 85557 just suddenly stops without anything noticeable in the log, but Grafana shows the sudden disconnect or crash during the active session where it doesn't recover and misses session 85561 until I hard reboot and it immediately bans a peer upon restart for a Genesis mismatch and then a huge block of "Pinned block cache limit reached. Evicting value" before it starts participating in validation again.

Blocks created in active sessions: Session - Blocks Created - % Reward 85596 - 18 - 100% 85592 - 18 - 100% 85588 - 18 - 100% 85584 - 18 - 100% 85580 - 18 - 100% 85576 - 18 - 100% 85572 - 18 - 100% 85568 - 18 - 100% 85564 - 13 - 72.2% 85561 - 0 - 0% 85557 - 14 - 77.8% 85553 - 18 - 100% 85549 - 18 - 100% 85545 - 18 - 100% 85541 - 18 - 100% 85537 - 18 - 100% 85533 - 18 - 100% 85529 - 18 - 100% 85526 - 18 - 100% 85522 - 18 - 100% 85518 - 18 - 100% 85514 - 18 - 100% 85510 - 18 - 100% 85506 - 18 - 100% 85502 - 18 - 100% 85499 - 18 - 100% 85495 - 18 - 100% 85491 - 14 - 77.8% 85487 - 0 - 0% 85483 - 18 - 100% 85479 - 18 - 100% 85475 - 18 - 100% 85471 - 18 - 100% 85467 - 18 - 100% 85463 - 18 - 100% 85460 - 18 - 100% 85456 - 18 - 100%

a0log04212024-3.log

Information on your setup.

Mainnet
Version 13.3
Aleph-node-runner
Validator
Flags: -n --ip --mainnet
OS: Debian 12 (bookworm)
VPS 8 vCPU Cores, 24 GB RAM, 1.6 TB SSD

Steps to reproduce

Unknown how to specifically reproduce. Issue encountered at various intervals after running node for several hours.

Did you attach relevant logs?

[X] I have attached logs (if relevant).

kostekIV commented 2 months ago

Hi, It looks like a potential system clock issue. Could you run for example timedatectl cmd and show the output?

k3vmcd commented 2 months ago

Hi, It looks like a potential system clock issue. Could you run for example timedatectl cmd and show the output?

Output of timedatectl is:

Local time: Mon 2024-04-22 15:39:06 CEST
Universal time: Mon 2024-04-22 13:39:06 UTC
RTC time: Mon 2024-04-22 13:39:06
Time zone: Europe/Berlin (CEST, +0200)
System clock synchronized: yes
NTP service: active
RTC in local TZ: no

kostekIV commented 2 months ago

For now issue with the system clock is our best bet. Is the problem still occurring?

k3vmcd commented 2 months ago

Yes, it occurred again during session 85607 and 85611. More logs appended to file. Searching for the start of session 85605 will show the progression as it suddenly has a failed verification and starts rejecting blocks due to unknown parents and eventually bans all peers again. After about two hours it self recovered.

a0log04222024.log.zip

ggawryal commented 2 months ago

If you take a look at block 76938114, it's produced at 2024-04-21 01:48:03.004, whereas in logs for your node it is tried to be imported at 2024-04-21 01:47:55.140 (8 seconds earlier):

 WARN tokio-runtime-worker sync: 💔 Error importing block 0x3ef6a0a8fc58d329a2ec380797c6038970899acdeffca30bfe4342206664a341: block has an unknown parent

Which most likely indeed indicates some issue with system clock in your VM (is at least 8s fast at the time of that warning). Maybe you could share also log from timesyncd service?

journalctl --unit=systemd-timesyncd.service | tail -n 100 > journalctl-timesync.log

k3vmcd commented 2 months ago

timesyncd log

journalctl-timesync.log

ggawryal commented 2 months ago

Thank you. Unfortunately I can't find anything interesting in this log (time was synced only once), but from the node logs you've attached, as we've said, it is very likely that these are some problems with system time - the system clock slows down from time to time, finally going fast at least 2 seconds, which causes errors like those you've mentioned:

2024-04-21 01:47:55.135  WARN tokio-runtime-worker sync: 💔 Verification failed for block 0xf0abcd769c4190c1290fb5ad87f19cec790bbb4d4afc0b83bb22a537b5b7c09a: "Header 0xf0abcd769c4190c1290fb5ad87f19cec790bbb4d4afc0b83bb22a537b5b7c09a rejected: too far in the future"

This could be either a system or hardware problem, but since there are strong premises that it is not the problem with the node protocol implementation itself, we can't assist you further with the investigation.

k3vmcd commented 2 months ago

Understood. Node is running fine since the latest log post.

Should I be using some other time sync service other than the Debian default timesyncd?

ggawryal commented 2 months ago

Good to hear that.

Generally, there are no particular recommendation nor requirements on that. The only requirement is that the clock should not drift by more than 1s, otherwise it may have problems with producing and importing blocks.

My personal opinion is that timesyncd with trusted NTP server set should be sufficient and if the system clock needs to synchronize often, I'd first make sure there are no hardware problems. Only then, I'd consider moving to higher precise alternatives to SNTP client implemented by timesyncd, like chrony. This is however only my opinion, and it could be different that opinions of other validators and blockchain users.

Cardinal-Cryptography / aleph-node-issues