Closed k3vmcd closed 2 months ago
Hi, It looks like a potential system clock issue. Could you run for example timedatectl
cmd and show the output?
Hi, It looks like a potential system clock issue. Could you run for example
timedatectl
cmd and show the output?
Output of timedatectl is:
Local time: Mon 2024-04-22 15:39:06 CEST
Universal time: Mon 2024-04-22 13:39:06 UTC
RTC time: Mon 2024-04-22 13:39:06
Time zone: Europe/Berlin (CEST, +0200)
System clock synchronized: yes
NTP service: active
RTC in local TZ: no
For now issue with the system clock is our best bet. Is the problem still occurring?
Yes, it occurred again during session 85607 and 85611. More logs appended to file. Searching for the start of session 85605 will show the progression as it suddenly has a failed verification and starts rejecting blocks due to unknown parents and eventually bans all peers again. After about two hours it self recovered.
If you take a look at block 76938114, it's produced at 2024-04-21 01:48:03.004, whereas in logs for your node it is tried to be imported at 2024-04-21 01:47:55.140 (8 seconds earlier):
WARN tokio-runtime-worker sync: 💔 Error importing block 0x3ef6a0a8fc58d329a2ec380797c6038970899acdeffca30bfe4342206664a341: block has an unknown parent
Which most likely indeed indicates some issue with system clock in your VM (is at least 8s fast at the time of that warning). Maybe you could share also log from timesyncd service?
journalctl --unit=systemd-timesyncd.service | tail -n 100 > journalctl-timesync.log
timesyncd log
Thank you. Unfortunately I can't find anything interesting in this log (time was synced only once), but from the node logs you've attached, as we've said, it is very likely that these are some problems with system time - the system clock slows down from time to time, finally going fast at least 2 seconds, which causes errors like those you've mentioned:
2024-04-21 01:47:55.135 WARN tokio-runtime-worker sync: 💔 Verification failed for block 0xf0abcd769c4190c1290fb5ad87f19cec790bbb4d4afc0b83bb22a537b5b7c09a: "Header 0xf0abcd769c4190c1290fb5ad87f19cec790bbb4d4afc0b83bb22a537b5b7c09a rejected: too far in the future"
This could be either a system or hardware problem, but since there are strong premises that it is not the problem with the node protocol implementation itself, we can't assist you further with the investigation.
Understood. Node is running fine since the latest log post.
Should I be using some other time sync service other than the Debian default timesyncd?
Good to hear that.
Generally, there are no particular recommendation nor requirements on that. The only requirement is that the clock should not drift by more than 1s, otherwise it may have problems with producing and importing blocks.
My personal opinion is that timesyncd
with trusted NTP server set should be sufficient and if the system clock needs to synchronize often, I'd first make sure there are no hardware problems. Only then, I'd consider moving to higher precise alternatives to SNTP client implemented by timesyncd
, like chrony
. This is however only my opinion, and it could be different that opinions of other validators and blockchain users.
Did you read the documentation and guides?
Is there an existing issue?
Description of the problem
Since upgrading to 13.3 my validator node is missing sessions, crashing frequently, and locking up such that it won't even open an SSH connection (e.g., crash during session 85418 and missed next 5 sessions until I hard-reset the VM). In between reboots, the node validates successfully. Occasionally it will self recover, such as when it disconnected and missed during session 85487 and then self-recovered in session 85491 and then ran fine without intervention until it missed blocks in session 85557 and then crashed and missed session 85561.
In the logs I get a lot of failed verifications and then it bans all its peers and disconnects from the network. This seems to be the cause of the node eventually crashing.
2024-04-21 01:48:30.905 WARN tokio-runtime-worker aleph-block-sync: Error handling request from PeerId("12D3KooWFi2pXEHAsujrfJ7E5ypAn8cgaKEdwnMgjyb8EJy1aznK"): forest error: block is too new.
During crashes where I have to hard-reset, the logs are unreadable with the error:
error from daemon in stream: Error grabbing logs: invalid character '\x00' in string literal
This appears to be when the docker log is hard-interrupted as it is reported separately in docker issues, and I had to manually remove the offending line from the log. This may indicate the node process is still hung when I hard-reset the VM.In attached log, session 85486 is where it shows errors that lead to disconnecting from all peers as it enters the active session 85487. Then, session 85557 just suddenly stops without anything noticeable in the log, but Grafana shows the sudden disconnect or crash during the active session where it doesn't recover and misses session 85561 until I hard reboot and it immediately bans a peer upon restart for a Genesis mismatch and then a huge block of "Pinned block cache limit reached. Evicting value" before it starts participating in validation again.
Blocks created in active sessions: Session - Blocks Created - % Reward 85596 - 18 - 100% 85592 - 18 - 100% 85588 - 18 - 100% 85584 - 18 - 100% 85580 - 18 - 100% 85576 - 18 - 100% 85572 - 18 - 100% 85568 - 18 - 100% 85564 - 13 - 72.2% 85561 - 0 - 0% 85557 - 14 - 77.8% 85553 - 18 - 100% 85549 - 18 - 100% 85545 - 18 - 100% 85541 - 18 - 100% 85537 - 18 - 100% 85533 - 18 - 100% 85529 - 18 - 100% 85526 - 18 - 100% 85522 - 18 - 100% 85518 - 18 - 100% 85514 - 18 - 100% 85510 - 18 - 100% 85506 - 18 - 100% 85502 - 18 - 100% 85499 - 18 - 100% 85495 - 18 - 100% 85491 - 14 - 77.8% 85487 - 0 - 0% 85483 - 18 - 100% 85479 - 18 - 100% 85475 - 18 - 100% 85471 - 18 - 100% 85467 - 18 - 100% 85463 - 18 - 100% 85460 - 18 - 100% 85456 - 18 - 100%
a0log04212024-3.log
Information on your setup.
Steps to reproduce
Unknown how to specifically reproduce. Issue encountered at various intervals after running node for several hours.
Did you attach relevant logs?