Node stop working / consensus error: Import failed

Cardinal-Cryptography / aleph-node-issues

Issue tracker for aleph-node related problems.

2 stars 0 forks source link

Node stop working / consensus error: Import failed #8

Closed tspring80 closed 7 months ago

tspring80 commented 1 year ago

Did you read the documentation and guides?

[X] I have inspected the documentation.

Is there an existing issue?

[X] I have searched the existing issues.

Description of the problem

Today my main node had problems with synchronization. I had to restart Docker. The same thing happened with my backup node 2 days before. The nodes have no commonality and are running in different countries. There are some discord messages from other node owners from missing blocks. Is there a problem in the network?

Attached you find the log files. Best regards, Tom | CCS

Information on your setup.

1) both 2) latest 3) alpeh-node-runner (validator) 4) common flags 5) Ubuntu 22.04.2 6) VPX

Steps to reproduce

2023-09-28_log.txt

Did you attach relevant logs?

[X] I have attached logs (if relevant).

kostekIV commented 1 year ago

Hi Tom, thanks for bringing this up. Could you provide us with more logs from before the restart? Ideally even before node stopped getting new blocks from the network. Thanks

tspring80 commented 1 year ago

Hi Kostek, here is the full 24h log. Best regards, Tom logs230928.zip

ggawryal commented 12 months ago

HI Tom, we can't tell the direct cause of the problem from logs, but we have experienced similar issues previously under unpleasant network conditions, as they were when the node stopped synchronizing. We are during the process of improving the block sync mechanism, and in the incoming release 12 we are adding a large component that should resolve the issue.

ggawryal commented 10 months ago

Hi, have the problems appeared again since back then? If not, I'd close the issue.

StakingBridge commented 10 months ago

Hello, we suffered a similar issue in the past on testnet, since the last update didn´t happened again and today we detected our backup server being stuck on mainnet.

Running from source code
Ubuntu 22.04.3 LTS
Threadripper 3970X, 128 GB RAM, 2 TB NVME SAMSUNG 980 PRO, 1GB SYMMETRICAL

The disk is brand new, also the RAM.

When you restart the node, is able to catchup 20-30 blocks if you check on realtime azero telemetry, but then stop syncing again.
The only solution is to clean the DB and download a new snapshot.
We can see the line "consensus error: Import failed: Database" but the disk have no errors and as we explained is brand new, changed because we had this issue on September, the same for the RAM.

Log: azerolog.txt.zip

ggawryal commented 7 months ago

Hello @StakingBridge, sorry for the late reply. From the log file you attached, it looks like the reason of sync stall in your case is that the database got corrupted, and cleaning the DB was the only solution indeed. The corruption must have happened prior to these logs, so I can't tell when exactly and why it got corrupted (maybe there was some power outage or something). Has this situation occurred again, and could you provide some more logs?

StakingBridge commented 7 months ago

Hello Gregorz Gawryal.For your information the issue was a faulty RAM module causing DB corruptions, is fixed, probably our fault because we forgot to close the issue on Github.Thank you.R.MarinEl 13 feb 2024, a las 10:58, Grzegorz Gawryał @.***> escribió: Hello @StakingBridge, sorry for the late reply. From the log file you attached, it looks like the reason of sync stall in your case is that the database got corrupted, and cleaning the DB was the only solution indeed. The corruption must have happened prior to these logs, so I can't tell when exactly and why it got corrupted (maybe there was some power outage or something). Has this situation occurred again, and could you provide some more logs?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>