Closed rpl-ffl closed 2 weeks ago
it looks to me that this PR piles up a lot of workarounds - like sending the
interrupt
signal to the container, whilecontainer.Stop()
is already doing this, If there is a strong reason for such workarounds it should be properly documented and proven by tests that it must be done this way. Otherwise, I would prefer to live (for some time) with a system that sometimes does not shutdown correctly. We generate new data all the time, right, i.e, it is not a big issue?
This was required for the consensus pipeline. Norma is used to generate events.db as an input to the consensus stage. To export events, we require graceful shutdown of the database. Without the changes, we do have sporadic dirty DB, which prevents events export and the pipeline fails. Example: https://scala.fantom.network/job/Consensus/job/Sonic-event-collector/40/console.
I understand for our purpose right now it's not a big issue, but the workaround was meant to be the fastest way to ensure we don't have a dirty DB.
it looks to me that this PR piles up a lot of workarounds - like sending the
interrupt
signal to the container, whilecontainer.Stop()
is already doing this, If there is a strong reason for such workarounds it should be properly documented and proven by tests that it must be done this way. Otherwise, I would prefer to live (for some time) with a system that sometimes does not shutdown correctly. We generate new data all the time, right, i.e, it is not a big issue?This was required for the consensus pipeline. Norma is used to generate events.db as an input to the consensus stage. To export events, we require graceful shutdown of the database. Without the changes, we do have sporadic dirty DB, which prevents events export and the pipeline fails. Example: https://scala.fantom.network/job/Consensus/job/Sonic-event-collector/40/console.
I understand for our purpose right now it's not a big issue, but the workaround was meant to be the fastest way to ensure we don't have a dirty DB.
Thanks for explanation.
I did some background reading and it looks to me that Sonic is not gracefully shutdown because it is executed as a sub-process of shell. See here: https://docs.docker.com/reference/cli/docker/container/kill/
It means that does not matter how sending signals is reworked here, it will not solve the problem. It must be done differently.
it looks to me that this PR piles up a lot of workarounds - like sending the
interrupt
signal to the container, whilecontainer.Stop()
is already doing this, If there is a strong reason for such workarounds it should be properly documented and proven by tests that it must be done this way. Otherwise, I would prefer to live (for some time) with a system that sometimes does not shutdown correctly. We generate new data all the time, right, i.e, it is not a big issue?This was required for the consensus pipeline. Norma is used to generate events.db as an input to the consensus stage. To export events, we require graceful shutdown of the database. Without the changes, we do have sporadic dirty DB, which prevents events export and the pipeline fails. Example: https://scala.fantom.network/job/Consensus/job/Sonic-event-collector/40/console. I understand for our purpose right now it's not a big issue, but the workaround was meant to be the fastest way to ensure we don't have a dirty DB.
Thanks for explanation.
I did some background reading and it looks to me that Sonic is not gracefully shutdown because it is executed as a sub-process of shell. See here: https://docs.docker.com/reference/cli/docker/container/kill/
It means that does not matter how sending signals is reworked here, it will not solve the problem. It must be done differently.
I was going for tini -g
- tini makes sure the PID is 1 and the -g makes sure the child process get the signal.
With the -g option, Tini kills the child process group , so that every process in the group gets the signal. This corresponds more closely to what happens when you do ctrl-C etc. in a terminal: The signal is sent to the foreground process group. https://github.com/krallin/tini
So, I can't argue if this change addressing SIGINT is the best way to implement this. I just know that we no longer have sporadic dirty DB for the 20 runs after and that's what the consensus pipeline needs right now.
This change as of the current form is no longer introduced to main and its implementation will be revisited before we attempt this again in main.
Minimum changes required to bring
main
up torapol/hotfix/consensus-pipeline
NumberOfValidator
toMandatoryNumberOfValidator
TotalNumberOfValidator
and propagate that to pre-node-start injection (leaving this as 100)