Problematic operation of udpcast

Pieshka commented 1 year ago

Hello

I've been working with FOG for a while now and I work with people who use FOG, although they don't always know how it works. I've noticed that there are two big problems with how Multicast works, especially for people who don't know how to deal with them.

Desynchronisation between hosts. Although the computers I restore (school environment) are identical in their configuration, in computer labs where the units are not the latest (i3-4170), there are cases when one of the computers, after restoring the largest partition - the system partition, stops for a long moment at "Syncing..." in partclone, when the rest of the computers at the time, are already going through it and start restoring the next partitions. And then the computer that was left behind no longer restores the third partition and hangs on the blank partclone screen. The solution is to increase the timeout, but this option is common to both the first partition and the rest of the partitions, and I would prefer the timeout to be short for the first partition (e.g. 2min) and longer for the rest of the partitions (e.g. 3min), as these computers can vary.

udp-sender processes running in the background This problem is somehow connected with the previous one. If the multicast fails - either at the very beginning or at the n-th partition - there are still udp-sender processes running in the background on the server, which you have to kill yourself from the shell. This has, in a way, already been described here, but this problem does not only apply to session closing in the FOG panel, but also to situations where, due to some error, the session closes itself, but the udp-sender process continues to exist. This sometimes creates problems with newly created multicast sessions.

This process has already been running since 14 January. Untitled

For me, these are not big problems, but for other computer lab supervisors, for whom Linux or fog are things they don't understand, these are pretty huge problems that they don't know how to deal with (they won't kill the udp-sender process themselves because they don't have access to the shell anyway). It would be good to improve the whole udp-sender calling mechanism in general. I know this is not a job for one afternoon, I'm also not good with PHP myself, so I'm not able to help. But I'm leaving this to give at least an outline of what would be worth improving in the future.

Sebastian-Roth commented 1 year ago

@Piotr86PL Thanks for bringing this up!

Desynchronisation between hosts.

We had a related discussion on Github #521 just a few weeks ago. The static timeout of 10 seconds between partitions is now halfway adjustable in the latest version. Would that suite your needs? Or would it need to be fully adjustable?

udp-sender processes running in the background

Definitely something we should look into and hopefully get it fixed soon. The main player here is the FOGMulticastManager service running. Do you have good ideas on detecting the situation described ("due to some error, the session closes itself")? I guess there are various situations that we need to cover. So let's try to collect some good ideas before we rush into implementing it.

Related topics in the fourms dicuss the --retries-until-drop option for udp-sender - though I am not sure this would really help us here: https://forums.fogproject.org/topic/12756/hosts-drop-out-of-multicast-session-on-partition-switch https://forums.fogproject.org/topic/11673/fog-multicast-sessions-what-happens-when-a-host-in-session-is-powered-off-and-what-happens-when-it-is-powered-back-on https://forums.fogproject.org/topic/11249/uncompleted-multicast

Pieshka commented 1 year ago

Would that suite your needs?

For me personally, it would be better if I could independently set the timeout of the first partition and the timeout between partitions, because normally I would like the timeout of the first partition to be e.g. 2 minutes and the timeout between partitions to be e.g. 40 seconds. Nevertheless, somehow I will adapt to the current way of doing things - restore problems with the first partition are really rare and more due to external factors, so a longer timeout of the first partition as the price of a longer timeout between partitions is something I can go for.

Do you have good ideas on detecting the situation described ("due to some error, the session closes itself")? I guess there are various situations that we need to cover. So let's try to collect some good ideas before we rush into implementing it.

Actually it's hard for me to say what exactly happens when udp-sender is running in the background, but there are no tasks in the FOG panel. At the time of writing that post I assumed it was because of some bug. But as I think about it now, it could simply have been that the computer lab supervisor, himself, removed the tasks that didn't start from the task list and I, when I checked FOG afterwards and saw that udp-sender was running but the tasks weren't there, assumed that it was due to an error that the tasks had finished themselves. When I get a chance I will ask the supervisors if they had any multicast problems.

If I were to name specific situations that would be worth detecting it would definitely be when a computer falls behind due to partclone taking too long to synchronise a disk. On the other hand, when I witnessed such a situation then in theory the computers should wait 12 seconds (I had 2 set in the panel as a timeout), for the slower one. And yet, it looked like the rest of the computers immediately started restoring the third partition and they certainly weren't waiting 12 seconds for the one where partclone was still syncing the second partition. It was as if they had all joined the session. It could be that they all join the session but one is still sitting on partclone? I don't know how the FOS code is structured. But still - regardless of what exactly happened here (I'll investigate this further), it would be good to detect hosts that are stuck for some reason.

This could be done so that every time a host sends a message to the server that it has finished restoring the partition. After receiving the first such message, the server starts a timer and if the other hosts do not send a similar message within some time, the server assumes that something has happened to them and automatically aborts the task and sends a message back to the client to reboot. This would solve all the problems where the client, for some reason, gets stuck at some stage. And with such a mechanism in place, the udp-sender invocation could be rebuilt so that the server doesn't launch all partitions as it does now, but only launches the next partition when it gets a message that the previous one has been restored everywhere.

The problem with the udp-sender running in the background, can be fixed so that if there are no multicast tasks in the queue then the server just kills the udp-sender (if it detects one running). And by the way, it would be good to make it so that terminating in the panel, the task of one host, automatically terminates all related tasks and also kills the udp-sender.

I am aware that these are not easy to implement, as they require rebuilding both FOG and FOS, plus there could be backward compatibility issues. For this reason, I encourage other readers to join the discussion and contribute their ideas.

Neustradamus commented 1 year ago

@Sebastian-Roth: Have you seen the last udpcast build?

https://www.udpcast.linux.lu/download/?C=M;O=D

And there is this:

https://github.com/brunston/nudpcast/pulls?q=is%3Apr+is%3Aclosed

Sebastian-Roth commented 1 year ago

Topic delayed with too little time on the 1.5.10 release. Will fix it for the next to come.

FOGProject / fogproject

Problematic operation of udpcast #536