MinaProtocol / mina

Mina is a cryptocurrency protocol with a constant size blockchain, improving scaling while maintaining decentralization and security.
https://minaprotocol.com
Apache License 2.0
1.99k stars 525 forks source link

1.3.2beta2 being all around the place. #11981

Open EmrePiconbello opened 1 year ago

EmrePiconbello commented 1 year ago

Preliminary Checks

Description

We run 3 nodes at 1.3.2beta2-6e4c7fc and there is always weird inconsistencies when we try to get some data node just crash and restart which we didn't dive much into it.

But the last time was clearly abnormal.

Not sure what but it was not clear we tried to export the logs out didn't work out. Any kind of command was just halted in the process instead of giving output

1) Restart, Logs were stating the node can't find the IP 2) Many crashes after a while because of the few manual restarts. 3) At some point stopped giving can't find IP error but started to crash right away claiming the password is wrong or corrupted. 4) In the process give some entry point errors and crashes too. 5) On its own after a very long time and crashes recovered and started work as expected. into

Steps to Reproduce

Not sure but this kind of random state happens after some uptime

Expected Result

Node to output correct error in logs.

Actual Result

Very random errors with no connection

How frequently do you see this issue?

Rarely

What is the impact of this issue on your ability to run a node?

High

Status

Status was not working

Additional information

This was mostly happening one node at a time and will recover after a random time frame. Without any change, nodes were crashing with some errors like I mentioned above and doesn't recover what we do because the actual reason to our knowledge is not what's written there. We have a centralized logging system in case it's gone there without a problem I can pull some from there but it will take some time.

shimkiv commented 1 year ago

Hey @EmrePiconbello , thanks for reporting this. I hope you do realise that without logs we can't do much here so please attach some if you can.

EmrePiconbello commented 1 year ago

Yes, I know that we are trying to locate the logs for this in case our system captures it. The reason I brought this up is recent versions bring up minor issues like here but they were mostly not that significant and common. With this release, we carry all of our nodes into this version to be sure. All of them were crashing with status commands, reporting key corrupted/wrong password even though it isn't, entry point and others. While it looks stable and ok in general there is something wrong with it just want to inform the team about that. Because it started as a very minor rare occurrence with a few versions back(We weren't sure since we have only 1 server at new releases which were not that healthy at that period and required hardware change). I don't think it's either minor or rare at this point. I just want to bring some attention to these even though we might not able to produce logs. I will tag you when we have the logs.

EmrePiconbello commented 1 year ago

@shimkiv Just shared logs over the drive for 5 days long + the only crash report created by the node. According to the logs, there are many crash records but none of them actually created. We have over 5TB of logs which makes it hard to find stuff when we need :) We should probably put resources into some better solution or completely left the log collection since these logs don't help us at all.

shimkiv commented 1 year ago

Thank you @EmrePiconbello! Where we can find the Drive link?

EmrePiconbello commented 1 year ago

@shimkiv here it's shared to logs@o1labs as mentioned in documentation https://drive.google.com/drive/folders/1iBUw6hSnnrOsTuCPaka8i8T0z4BtBhvW?usp=sharing

The link was edited and the issue opened for the wrong logs I share.

deepthiskumar commented 1 year ago

-Two instances of libp2p helper receiving sig term in 2022-07-23_23-02-02 logs -One instance of crash due to VRF evaluator process connection close error. {"timestamp":"2022-07-21 04:01:45.514631Z","level":"Error","source":{"module":"Block_producer","location":"File \"src/lib/block_producer/block_producer.ml\", line 455, characters 10-22"},"message":"Error fetching slots from the VRF evaluator : $error. Trying again","metadata":{"error":"((rpc_error (Connection_closed (\"EOF or connection closed\")))\n (connection_description (\"Client connected via TCP\" (localhost ))\n (rpc_tag rpc_parallel_plain_7) (rpc_version 0))"}} in 2022-07-23_23-02-02 logs -Two instances of daemon crashing after a certain uptime (expected behaviour)

mrmr1993 commented 1 year ago

@shimkiv here it's shared to logs@o1labs as mentioned in documentations https://drive.google.com/drive/folders/1q6y9HnftpeNRjviXcb-DaSVNci9jxGLe?usp=sharing

@EmrePiconbello the logs in that drive folder seem to be from July (running 1.3.1), and I can't see anything in logs@o1labs. Would you mind sharing your logs again?

EmrePiconbello commented 1 year ago

In that case, the automated log collection didn't catch the state I mentioned above. We also spot some irregularities with log time stamps. They were created much later than expected and some of them are corrupted. I will add these logs to the existing folder in case they catch some of the irregularities. Unfortunately, after some checking, I couldn't find any logs we collect while we were having these random, not related crashes.

@mrmr1993 Our logs might be confusing because we tried a lot of different logging options. I think what you saw is old logs in the packages because at a certain point they overwrite the new ones instead of the old ones and the old ones stay there forever. Whatever we do we can't get consistent 24-hour coverage for the logs (the only way we could do that is to capture and store all logs externally to a server but because of the whole storage and networking cost multiplied by the number of nodes we are running it was not feasible.) After some point, we gave up and change the settings between versions to see how it goes. As it stands right now, it's some data over having nothing.

Logs also might be mixed up https://drive.google.com/drive/folders/1iBUw6hSnnrOsTuCPaka8i8T0z4BtBhvW?usp=sharing this should be the link for this issue. There are a lot of issues but we can't reach any conclusive data and mostly leave it as it is. That link is probably the issue from where we send a lot of tx and till we create a block our transactions didn't include in to blocks by the block producers for hours. At least that's my prediction from the naming I should confirm that with my team. Sorry for sharing the wrong link.

EmrePiconbello commented 1 year ago

The first link was about this issue. I remember we were talking about this but couldn't find any details so opened a new issue for that one. https://github.com/MinaProtocol/mina/issues/12037

mrmr1993 commented 1 year ago

@EmrePiconbello I'm only seeing crashes/restarts related to the node being unable to reach bot.whatismyipaddress.com. Is this the primary issue, or was there something else?

If so, can you check whether you can reach that address from a shell on those machines?

EmrePiconbello commented 1 year ago

@mrmr1993 Yes that was the weirdness of the node. Let me explain what happened in general. First, the node was in sync and performing fine. but the snark system was failing. I connected to the server to check it out it was going fine and whatever command I throw just hangs forever without response.(I check logs while I am not 100% sure I remember seeing ip/connection issue in the logs and seeing the status at the latest block and assuming it's about the snark uptime system. ) Then I send the status and it crashed this is when I send a status command over cli our monitoring software pull it over API every min.(This happened a few times in recent releases not sure about the reason behind this.)

Follow-up typically after crash node works as normal this time it keeps crashing because it can't find the IP of the node as it can't reach these websites. We checked and literally some other services we have on that service since that server is very for a single mina node. We don't run snark workers there. Anyway, we were like everything is correct what's the issue? Since there is no version change we didn't think it was flag-related. We try to keep monitor but there is not much to gain. After stopping and starting or completely removing that container and starting with the same configs fix the IP issue. The problem didn't stop there now it was throwing some other random error it. After several crashes and everything, It started to say corrupted or wrong password for the key so we assume the key was actually corrupted though that was not the case. After full cleanup of everything and fresh key files and everything. It was still crashing with the same error so we left it. It recovers it's on its own after some time.

So why I bring this up is because with every release it goes for worse. We didn't mind some entrypoint crashes and crashes with status commands etc. We nearly remove all extra flags from our config and in our last release which happened today, we also remove the stop time. We keep removing functions out of nodes because with every update or version change it's a headache to troubleshoot because there is no clear log of these either as Gareth reported here https://github.com/MinaProtocol/mina/issues/12033

All these make our process hard and not viable to operate, hold archive or whatever else other parties want to do. I am pulling some attention to these for hopefully an improvement to these. Also, I assume there is some underlying issue in which node cause to output of these random logs. I call them random because even if for some reason the node can't reach the internet while there was no block to that we refresh the key files and it keeps reporting a corrupted wrong password and start working without an issue on its own without any change.

shimkiv commented 1 year ago

Related to https://github.com/MinaProtocol/mina/issues/12031 and https://github.com/MinaProtocol/mina/issues/12032

EmrePiconbello commented 1 year ago

Might have a relation with it but I can assure no IP change happened. All of our nodes are on static IP some of them are in our control since they are from our own infra. (This one is co-located bare metal server from Germany we didn't spot any abnormalities across various logs and metrics we pull for monitoring) I don't have words for describing what happened in this issue because nothing adds up to something. It was technically working as its' getting synced at the highest block height when we check from graphql. It was outputting IP-related issues in the logs. After restarting keep crashing from the start for IP-related issues 5-20 minutes then keep failing for 3-4 other reasons for 2-3 hours long between various restarts and starting from scratch. In the end, recovered on its own and worked as if nothing happened