dmwm / CRABServer

15 stars 38 forks source link

add timeout for TW actions #8350

Closed belforte closed 2 weeks ago

belforte commented 5 months ago

with reference to https://cms-talk.web.cern.ch/t/crab-task-queued-on-command-submit-for-a-while/39714 we have no protection against the situation where TW slaves get stuck for many hours.

So far this has been rare, but we can't risk users to keep submitting stalled tasks and make system grind to halt.

Can start with e.g. 1hour timeout and monitor

belforte commented 5 months ago

@novicecpp do you agree to take this once CI/CD is done ? Should not be too much work, but I do not have a specific implementation plan in mind.

belforte commented 5 months ago

related to #7720

novicecpp commented 5 months ago

Yes. I can but not immediately after CI. I need to go back to PyPI to make Publisher work with PyPI image.

belforte commented 5 months ago

PR https://github.com/dmwm/CRABServer/pull/8352 is related to this

novicecpp commented 4 months ago

@belforte do you have any implementation idea in mind?

My idea is to kill worker process entirely if it reach timeout threshold.

There is 2 ways to achieve this:

Acknowledge: Thanks Dario for the brainstorming!

belforte commented 4 months ago

I had no ideas ! firsrt way seems better than the second. But I was hoping that we could act inside the Handler. And keep it simpler.

Something like this maybe https://python.pages.doc.ic.ac.uk/2021/lessons/decorator/03-application/02-timeout.html ? The good thing is that relies on standard python signal.alarm(time) See also https://pypi.org/project/timeout-decorator/ (I just googled things up now).

Anyhow, both of you surely know more than me !

Unrelated, using concurrent.futures is appealing, but I a bit worried about the amount of work. Yet if it allows to fork a process for each new task instead of slaves which persists for days it may be worth.

novicecpp commented 4 months ago

I did not consider signal option because it is not work well with multiprocessing, most of solution I found on the google required some hackish approach (for example).

novicecpp commented 2 weeks ago

To-do:

belforte commented 2 weeks ago

let's move enable and test to a different issue