NethermindEth / juno

Starknet client implementation.
https://juno.nethermind.io
Apache License 2.0
371 stars 157 forks source link

OOM Crashes on Juno Pod After Restart During Heavy Load #1832

Open wojciechos opened 2 months ago

wojciechos commented 2 months ago

Increased traffic targeting the starknet_call method on our k8s pod pushed CPU usage to 100%, leading to request failures and block sync issues. Subsequent restarts of the pod resulted in immediate OOM errors at startup. However, after applying a fresh database, the pod started to sync properly without any OOM issues which suggests that db has been corrupted(?).

image k8s Logs:

terminated
Reason: OOMKilled - exit code: 137
Started at: 2024-04-19T15:14:04+05:30
Finished at: 2024-04-19T15:14:51+05:30

Possible Causes:

//UPDATE - 06.05.2024 Pod unable to keep up with syncing, resulting in failed requests due to reaching CPU limit. Actions taken: Added more pods, restarted pod, but no improvement. Resolution: Removing and replacing the DB resolved the issue. Next steps: Prioritize investigating and fixing the underlying cause.

06-05-2024-incident.pdf