Closed belforte closed 2 weeks ago
@novicecpp do you agree to take this once CI/CD is done ? Should not be too much work, but I do not have a specific implementation plan in mind.
related to #7720
Yes. I can but not immediately after CI. I need to go back to PyPI to make Publisher work with PyPI image.
PR https://github.com/dmwm/CRABServer/pull/8352 is related to this
@belforte do you have any implementation idea in mind?
My idea is to kill worker process entirely if it reach timeout threshold.
There is 2 ways to achieve this:
max_workers=1
) and use result timeout (similar to join()
in multiprocessing
) as a signal to terminate the child. Simple enough, in exchange for the extra overhead of spawning threads (also, later we can refactor multiprocessing part by replacing it with concurrent.futures at MasterWorker)Acknowledge: Thanks Dario for the brainstorming!
I had no ideas ! firsrt way seems better than the second. But I was hoping that we could act inside the Handler. And keep it simpler.
Something like this maybe https://python.pages.doc.ic.ac.uk/2021/lessons/decorator/03-application/02-timeout.html ? The good thing is that relies on standard python signal.alarm(time) See also https://pypi.org/project/timeout-decorator/ (I just googled things up now).
Anyhow, both of you surely know more than me !
Unrelated, using concurrent.futures
is appealing, but I a bit worried about the amount of work. Yet if it allows to fork a process for each new task instead of slaves which persists for days it may be worth.
I did not consider signal option because it is not work well with multiprocessing
, most of solution I found on the google required some hackish approach (for example).
To-do:
let's move enable and test to a different issue
with reference to https://cms-talk.web.cern.ch/t/crab-task-queued-on-command-submit-for-a-while/39714 we have no protection against the situation where TW slaves get stuck for many hours.
So far this has been rare, but we can't risk users to keep submitting stalled tasks and make system grind to halt.
Can start with e.g. 1hour timeout and monitor