Open EmrePiconbello opened 1 year ago
Just checked O(1)'s testworld archive dbs, both have no missing blocks
@EmrePiconbello you wrote in #14414:
[...] because of other weird behavior with postgre we move the postgre to remote isolated instance. Since we move it to remote postgre other issue we reported didn't happen.
Am i correct that you are no longer suffering from this issue after your infra changes?
This issue cause us to migrate to remote postgre but having remote postgre have other issues.
@EmrePiconbello can you provide more information around these other issues? Thanks in advance!
@amc-ie I will try to summaries. The server we are running postgre and archive process and the mina node was using 128gb ram and all the cpu or something like that. We were using docker. At the time this was the error we see in the server which I opened this issue.
We didn't investigate further assuming it's docker/postgre version related issue which archive node cause it to trigger.
Following that we just plug the archive process to our remote test postgre server which is operational since 2018. By doing that we want to eliminate the variable of postgre and docker.
Archive process connected to remote postgre brought up this issue which was causing it to miss blocks etc. https://github.com/MinaProtocol/mina/issues/14415
After some research we also realize this issue still persist https://github.com/MinaProtocol/mina/issues/14421 we spot it years ago few times but we never knew it was related to archive node since we closed our archive in very short time frame after mainnet.
https://github.com/MinaProtocol/mina/issues/14755 This is some kind of a workaround for the above issue.
At the moment I would just consider this one as a problem unless it's worked on https://github.com/MinaProtocol/mina/issues/14415
For product archive node ready for actual product environment releases. Having a archive node which is highly available auto scale have multiple shards and clusters... is very important so archiving process to local postgre and remote postgre should be similar.
Preliminary Checks
Description
Archive process fail with below error. The problem is after this postgre start taking up all the resource in the server.
Steps to Reproduce
1.Run archive node wait till error 2.Postgre start pulling all the resource from server and make whole server become mostly unresponsive. 3.Without killing the postgre it doesn't resolve. While it might be postgre related problem I never encounter this kind of issue on postgre we utilize for various tests ...
Expected Result
Crash and recover on it's own. At least not cause postgre the clog up the server.
Actual Result
No matter what we throw as specs this issue happened after some time frame and then postgre take all of the server resources.
How frequently do you see this issue?
Frequently
What is the impact of this issue on your ability to run a node?
High
Status
Additional information
No response