Open novicecpp opened 1 month ago
Simply check Publisher process if it is still running before checking with time, maybe resolve the issue.
that's surely good ! But maybe we can also introduce a timeout. Originally the idea was that at time a single Publisher iteration could take hours and we did not want to have a timeout + hard kill. Publisher was not designed with the idea of being stateless and idempotent, and I opted for "safety". But we were only using stop as issued by operator. As we try to have something automatic, and have enough experience, we can probably try a timeout at 2h.
The stop.sh script (or manage.sh stop in new PyPI image) will wait forever if it run after exceed "Next cycle" time and no Publisher process is running.
The script parse timestamp from last log line from
logs/log.txt
:https://github.com/dmwm/CRABServer/blob/97bf5913bd1900c229d87ab77c803e39a15743e1/src/script/Deployment/Publisher/stop.sh#L19-L23
The
delta
is gradually decreasing while the time (epoch) is increasing. No process is running mean no one will update the log line and script will stuck forever.Workaround is exec to the container and kill
/bin/bash ./stop.sh
to allow Deploy_TW to continue.