dmwm / CRABServer

15 stars 37 forks source link

Publisher stop.sh fails if the Publisher process stops for longer than the next cycle #8442

Open novicecpp opened 1 month ago

novicecpp commented 1 month ago

The stop.sh script (or manage.sh stop in new PyPI image) will wait forever if it run after exceed "Next cycle" time and no Publisher process is running.

The script parse timestamp from last log line from logs/log.txt:

2024-05-29 11:13:12,912:INFO:PublisherMaster,296:Next cycle will start at 11:39:01

https://github.com/dmwm/CRABServer/blob/97bf5913bd1900c229d87ab77c803e39a15743e1/src/script/Deployment/Publisher/stop.sh#L19-L23

The delta is gradually decreasing while the time (epoch) is increasing. No process is running mean no one will update the log line and script will stuck forever.

Workaround is exec to the container and kill /bin/bash ./stop.sh to allow Deploy_TW to continue.

novicecpp commented 1 month ago

Simply check Publisher process if it is still running before checking with time, maybe resolve the issue.

belforte commented 1 month ago

that's surely good ! But maybe we can also introduce a timeout. Originally the idea was that at time a single Publisher iteration could take hours and we did not want to have a timeout + hard kill. Publisher was not designed with the idea of being stateless and idempotent, and I opted for "safety". But we were only using stop as issued by operator. As we try to have something automatic, and have enough experience, we can probably try a timeout at 2h.