deso-protocol / run

Run your own DeSo node
https://docs.deso.org
190 stars 94 forks source link

Node suddenly shows "502 Bad Gateway" error message #75

Closed ConfidenceYobo closed 3 years ago

ConfidenceYobo commented 3 years ago

Everything works as normal but sometimes it suddenly shows "502 Bad Gateway" error and everything stops working until I restart the node - sometimes I may need to resync the node for everything to work as normal.

tijno commented 3 years ago

check logs for issue like "too many open files" - which is the most common cause for backend to crash causing the 502 Bad Gateway error.

Also check if you may be running out of memory.

ConfidenceYobo commented 3 years ago

Thanks for your response. I have checked, I am not running low on memory. I have used up only 2% of my memory and also have enough space on disk.

ConfidenceYobo commented 3 years ago

I have checked the log, I can't find any "too many open files" error, but I found this Server._handleTransactionBundle: Rejected transaction < TxHash: 4cb5bb4e968c37c98376ceb1c14aac74be1303bc309eddfc343f92ad3a5f42b7, TxnType: BC1YLiSpY6Ec9NWTNfmziLhSrrdB8dbVx4nspWAgkZgKic3Wxteiynx, PubKey: LIKE > from peer [ Remote Address: 34.123.41.111:17000 PeerID=2 ] from mempool: TxErrorDuplicate in the log

tijno commented 3 years ago

Those do happen often as a result of a crash - it may stop TXIndex keeping up with new blocks. But ive not seen it cause crashes.

ConfidenceYobo commented 3 years ago

What are some possible causes of crashes?

tijno commented 3 years ago

What i mentioned above

out of memory out of files

also

out of discspace server crash

ConfidenceYobo commented 3 years ago

But none of this is the case for me

tijno commented 3 years ago

I get this sometimes on the admin section of a node - and i have to logout from my bitclout account on the node and log back in for it to go away.

Are you seeing the same?

marnimelrose commented 3 years ago

I had that on mine when it was running. On Aug 16, 2021, 10:30 AM -0700, BitClout @Tijn @.***>, wrote:

I get this sometimes on the admin section of a node - and i have to logout from my bitclout account on the node and log back in for it to go away. Are you seeing the same? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

ConfidenceYobo commented 3 years ago

It happens mostly when am not logged in to the bitclout node but using the api

tijn commented 3 years ago

@tijno sorry for spamming the conversation again... but I keep getting notified now because of the tagline behind your name: "(BitClout @Tijn)" 🤣

tijno commented 3 years ago

oh man github :) sorry @tijn ill change it

tijn commented 3 years ago

oh man github :) sorry @tijn ill change it

@tijno Thank you!

tijno commented 3 years ago

all done

ConfidenceYobo commented 3 years ago

fixed the issue by increasing the memory of the server to 64GB.

HPaulson commented 3 years ago

Hey -- wanted to drop a comment here as this has been happening on 8 nodes under my company's management. All of the machines have 30gb of memory, and we solve the OOMs by simply using docker's restart flag (I know, not a great option, but it works temporarily). After speaking with @tijno, he runs nodes on a 32gb machine, and max's at around 60% memory usage. I'll also note, that all eight of these nodes have been synced for an extended period of time, and these crashes occur quite randomly. The following OOM occurs:

[2025186.138224] Out of memory: Killed process 215489 (backend) total-vm:266280732kB, anon-rss:30178620kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:108952kB oom_score_adj:0
[2025187.222710] oom_reaper: reaped process 215489 (backend), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

The OOMs have all been caused by rejected Duplicate Tx's:

E0831 14:01:03.874597 1 server.go:1311] Server._handleTransactionBundle: Rejected transaction < TxHash: 25452952cf8b3a8adc6f3412a2bcc4b9aa4e7960ec4d3052b8f4f8e1ff42d93c, TxnType: BC1YLhhrJUg1ms7P3YMQcjGPTVY9Tf8poJ1Xdeqt6AsoJ5g3zNvFz98, PubKey: PRIVATE_MESSAGE > from peer [ Remote Address: 34.123.41.111:17000 PeerID=5 ] from mempool: TxErrorDuplicate

While increasing memory is definitely a solution, and restarting on the crash is also.... something haha, I see no reason why a node can't run on a 30gb machine. My worry is that there's a potential memory leak, even though such is fairly uncommon in go... Beyond this, I have little idea why an already-synced node would require more than 30g -- especially since this is occurring uniformly across all 8 nodes under our management all after a Duplicate TX error is produced.

It is, of course, also possible that I'm just missing something. Would really appreciate any suggestions, as simply restarting the process after a crash isn't likely the best approach, let alone being effective long-term hahaha

maebeam commented 3 years ago

We profile our nodes 24/7 and aren't aware of any memory leaks. Badger is a memory hog and is on its way out.

HPaulson commented 3 years ago

Makes sense -- thanks for the reply @maebeam

Glad to see badger go for a number of reasons hahaha