Add"Sanity Check" scenario as starting point

rpl-ffl commented 3 weeks ago

Minimum changes required to bring main up to rapol/hotfix/consensus-pipeline

Left out changing NumberOfValidator to MandatoryNumberOfValidator
No longer compute TotalNumberOfValidator and propagate that to pre-node-start injection (leaving this as 100)
- Meaning currently, we hard-code so that all scenarios could have up until 100 validators
Enable SIGINT propagation from docker client into the sonic client.

rpl-ffl commented 3 weeks ago

it looks to me that this PR piles up a lot of workarounds - like sending the interrupt signal to the container, while container.Stop() is already doing this, If there is a strong reason for such workarounds it should be properly documented and proven by tests that it must be done this way. Otherwise, I would prefer to live (for some time) with a system that sometimes does not shutdown correctly. We generate new data all the time, right, i.e, it is not a big issue?

This was required for the consensus pipeline. Norma is used to generate events.db as an input to the consensus stage. To export events, we require graceful shutdown of the database. Without the changes, we do have sporadic dirty DB, which prevents events export and the pipeline fails. Example: https://scala.fantom.network/job/Consensus/job/Sonic-event-collector/40/console.

I understand for our purpose right now it's not a big issue, but the workaround was meant to be the fastest way to ensure we don't have a dirty DB.

kjezek commented 3 weeks ago

it looks to me that this PR piles up a lot of workarounds - like sending the interrupt signal to the container, while container.Stop() is already doing this, If there is a strong reason for such workarounds it should be properly documented and proven by tests that it must be done this way. Otherwise, I would prefer to live (for some time) with a system that sometimes does not shutdown correctly. We generate new data all the time, right, i.e, it is not a big issue?

This was required for the consensus pipeline. Norma is used to generate events.db as an input to the consensus stage. To export events, we require graceful shutdown of the database. Without the changes, we do have sporadic dirty DB, which prevents events export and the pipeline fails. Example: https://scala.fantom.network/job/Consensus/job/Sonic-event-collector/40/console.

I understand for our purpose right now it's not a big issue, but the workaround was meant to be the fastest way to ensure we don't have a dirty DB.

Thanks for explanation.

I did some background reading and it looks to me that Sonic is not gracefully shutdown because it is executed as a sub-process of shell. See here: https://docs.docker.com/reference/cli/docker/container/kill/

It means that does not matter how sending signals is reworked here, it will not solve the problem. It must be done differently.

rpl-ffl commented 3 weeks ago

it looks to me that this PR piles up a lot of workarounds - like sending the interrupt signal to the container, while container.Stop() is already doing this, If there is a strong reason for such workarounds it should be properly documented and proven by tests that it must be done this way. Otherwise, I would prefer to live (for some time) with a system that sometimes does not shutdown correctly. We generate new data all the time, right, i.e, it is not a big issue?

This was required for the consensus pipeline. Norma is used to generate events.db as an input to the consensus stage. To export events, we require graceful shutdown of the database. Without the changes, we do have sporadic dirty DB, which prevents events export and the pipeline fails. Example: https://scala.fantom.network/job/Consensus/job/Sonic-event-collector/40/console. I understand for our purpose right now it's not a big issue, but the workaround was meant to be the fastest way to ensure we don't have a dirty DB.

Thanks for explanation.

I did some background reading and it looks to me that Sonic is not gracefully shutdown because it is executed as a sub-process of shell. See here: https://docs.docker.com/reference/cli/docker/container/kill/

It means that does not matter how sending signals is reworked here, it will not solve the problem. It must be done differently.

I was going for tini -g - tini makes sure the PID is 1 and the -g makes sure the child process get the signal.

With the -g option, Tini kills the child process group , so that every process in the group gets the signal. This corresponds more closely to what happens when you do ctrl-C etc. in a terminal: The signal is sent to the foreground process group. https://github.com/krallin/tini

So, I can't argue if this change addressing SIGINT is the best way to implement this. I just know that we no longer have sporadic dirty DB for the 20 runs after and that's what the consensus pipeline needs right now.

This change as of the current form is no longer introduced to main and its implementation will be revisited before we attempt this again in main.

Fantom-foundation / Norma

Add"Sanity Check" scenario as starting point #256