FOGProject / fos

FOG Operating System
30 stars 33 forks source link

Don't wait indefinitly on network errors #57

Closed WhiteAls closed 1 year ago

WhiteAls commented 1 year ago

If we have problems with DHCP or we can't reach the FOG web server, then the network script will wait for user input indefinitely. In unattended scenarios (cron induced capture) this will lock the system until the user notice the problem.

In my case, the problem is often resolved by a simple reboot (seems like a spanning tree problem, but it is strange that a simple reboot fixes the problem).

I undestand that user should have the ability to analyze the problem by reading on screen log, but there is no need to wait for input indefinitely. Instead, we can use read -t 60 or less, as we will hit another 60 seconds timeout in the next script

Sebastian-Roth commented 1 year ago

@WhiteAls I haven't checked the PRs in a while. Sorry for the long delay and thanks for bringing this up!

While I can see why this is very useful in your case I would like to raise the question if there could be others depending on an indefinite wait in this case?

In case you have a solid network issue the machine will just keep rebooting and looping forever. Not great but also not worse than waiting on the read forever.

The only drawback I can see so far is that people might now revise an error message shown on screen while they are away for just a few moments.

WhiteAls commented 1 year ago

Thank you for your answer! Hmm, yes, perhaps we should have discussed this question on the forum first. I think this is a rather controversial issue, it is hard to find the only correct solution here. As you can see, I already disagree with your decision 😅

Anyway, let's consider the following scenario. I set up an image capture every Sunday at midnight. For some reason, this process gets stuck and I only discover the problem in the morning, waking up to panicked user calls 😆

Jokes aside, this raises another question - maybe we need a cron task that will monitor stuck tasks (for example, if a task launched through cron hasn't been completed in an hour) and cancel them, instead of a script that will wait indefinitely or for two minutes (or more) so that the user can read the logs? We end up in no image either way.

Or perhaps check that the capture was caused by a cron task and not wait indefinitely just in that case. I can't imagine why you would set up a capture using a cron task to monitor its execution 🤔

Or maybe just set timeout to 10 miutes. There is a plenty of options. It just depends on how much we want to complicate the system. I personally think that the simple solution is the best solution, but it's not up to me to decide here

Sebastian-Roth commented 1 year ago

@WhiteAls I like your thinking. Keep it simple (and stupid).

Going to merge this as I have not come up with an idea how this could cause trouble to other people.