Closed belforte closed 1 month ago
I manually set the task status of [1] to "submitfailed" in the DB
[1]
240517_215114:piperov_crab_SONIC_MiniAOD_CRAB_testPurdue807_1x4900-1000j
240517_214911:piperov_crab_SONIC_MiniAOD_CRAB_testPurdue806_1x4900-100j
I am testing adding this to my task
config.Debug.extraJDL = ['Requirements = regexp("bell",Machine)==False']
interestingly, in DagmanSubmitter, just before calling schedd.submit(dagAd,...
the dagAd
object contain both these
(Pdb) dagAd['Requirements']
true || false
(Pdb) dagAd['Requirements ']
' regexp("bell",Machine)==False'
(Pdb)
The former comes from https://github.com/dmwm/CRABServer/blob/fa964382216c9ecac345927afc8ff6b02eee0cb2/src/python/TaskWorker/Actions/DagmanSubmitter.py#L490 and I have no idea what it is supposed to mean, but since Brian B. put it there in the original version, it must be needed or at least not harm. Maybe a way to say "no requirements from here", since anyhow schedd will add requirements based on site, memory, cores...
The latter appears to contain an extra space and extra single quotes.
Indeed if I do print(dagAd)
I get:
[...]
CRAB_UserHN = "belforte";
'Requirements ' = " regexp(\"bell\",Machine)==False";
CRAB_UserRole = undefined;
[...]
CRAB_AsyncDest = "T2_CH_CERN";
Requirements = true || false;
CRAB_UserDN = "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=belforte/CN=373708/CN=Stefano Belforte";
[...]
and when I execute the submit line I get
-> clusterId = schedd.submit(dagAd, 1, True, resultAds)
(Pdb) n
terminate called after throwing an instance of 'boost::python::error_already_set'
./start.sh: line 65: 24746 Aborted (core dumped) python3 -m pdb ${COMMAND_DIR}/SequentialWorker.py ${CONFIG} --logDebug
which IMHO suggests that setting Requirements
twice triggers the error_already_set
in the C++ side of the HTCondor API which ends in abort and core dump, not a python exception.
I do not know if there's a way to catch that.
hmm.. not really. HTCondor doc clearly states that if multiple Requirementa
classAd's are present in the submission, only the last one is used. So the boost::python::error_already_set
must be something else
I am also quite puzzled that single quotes made it to the classAd value, the code which handles the extraJDL is in here https://github.com/dmwm/CRABServer/blob/fa964382216c9ecac345927afc8ff6b02eee0cb2/src/python/TaskWorker/Actions/DagmanSubmitter.py#L89-L105
If I remove the extra spaces around the =
in the extraJDL
things "work" in the sense that CRAB overrides the user attempt to redefine Requirements with "our" ones, as by design, and there is no error.
This can be achieved stripping start/end spaces in adName
and adVal
:
https://github.com/dmwm/CRABServer/blob/fa964382216c9ecac345927afc8ff6b02eee0cb2/src/python/TaskWorker/Actions/DagmanSubmitter.py#L100
which makes the code a bit more robust.
Still I do not know how to possibly catch the boost error :-(
Do we need to resurrect "calling schedd.submit in a forked subprocess" ?
I have no problem with the user's attempt to set Requirements to be ignored, documentation clearly states that extraJDL
is for experts only. In this case people who tried to use it did not know how glideinWms works.
Indeed even the simple config.Debug.extraJDL = ['This = that']
results in boost::python::error_already_set
.
Same for config.Debug.extraJDL = ['This = "that"']
.
Conclusion: spaces around the =
sign are what kills us and need to be sanitized regardless of the very special use case which gave start to this
As to catching the fact that slave aborts and task is left in QUEUED, the only info we have atm are these lines in nohup.out
i.e. the stdout/stderr
or the master worker:
[crab3@crab-prod-tw01 TaskManager]$ cat nohup.out
[...usual messages about jwt and OpenSSL..]
terminate called after throwing an instance of 'boost::python::error_already_set'
terminate called after throwing an instance of 'boost::python::error_already_set'
[crab3@crab-prod-tw01 TaskManager]$
We can possibly monitor-for or otherwise discover dead workers, but a restart will not fix (the QUEUED tasks will be processed again) and checking logs to find which task was being processed and what went wrong will take a lot of human time.
One way I can think f now it to change MasterWorker to fork a separate process for each task, up to a maximun number of concurrent ones. Like we do in PublisherMaster. Instead of the current fixed pool of slaves which get work from a shared queue. That's quite a change, but may help in other ways too.
Another way is to check if slave is alive inside the master loop and record somewhere the current task of a slave.
I do not know how to do it atm, but could be possible. TW slaves are instances of multiprocessing.Process
like the MasterPublisher ones for which we already use the is_alive
method
Checking on heatch of TW slave is something we "need to do" since ever, urgency comes and go as problems which make them die come and go.
@novicecpp I will welcome your suggestions @mapellidario I will welcome your suggestions
!!! Sorry Marco Mambelli if you were spammed.
I had experience with subprocess.Popen
, I never really used multiproessing, so it took me a while to come up with a proper suggestion.
Long story short: I think that in Worker.py
we need to regularly loop over every process in self.pool
and:
p.join(0.1)
: join with a timeout, see [1]. if does not block forever, it blocks at maximum for 0.1
seconds, if the process is still running nothing will happen, if the process finished, then p.is_alive()
and p.exitcode
are updatedp.is_alive()
: this returns a consistent result only after a join![1] https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Process.join
I think that is_alive
can be trusted but you need to put some delay between these lines to give the process time to die
os.kill(pkill, signal.SIGKILL)
print("check if a process died")
for p in pool:
print(f" died(0)? pid={p.pid}; alive={p.is_alive()}; exitcode={p.exitcode}")
I added 1 sec sleep and it is OK, if I am not mistaken.
all in all, I use is_alive
in Publisher and works even w/o join
Sorry if I was not clear, I want feedback on a plan. Not converge on all implementation details right away. I know how to start processes, I am not sure how exactly to change the Worker.py code which we did not touch "since ever" w/o too much risk. And I do not know a good way to tell which task was being worked on when process. crashed
If the plan is "let's invest time to make sure that we detect dead workers", then I am all in.
if you need help with a plan for " a good way to tell which task was being worked on when process. crashed" or "I am not sure how exactly to change the Worker.py code which we did not touch "since ever" w/o too much risk", then sorry i am not sure how to do it without giving feedback that will be rejected as "implementation details"
in other words: it will be a bit of work and require a lot of testing. If we decide to do it, let's open a new issue and discuss details there.
Adding detection of dead workers e.g. in https://github.com/dmwm/CRABServer/blob/40a796ffe3d1c7b1b2cddf6a6e74c3b08e2b5a3a/src/python/TaskWorker/Worker.py#L246
by adding an is_alive
check should be easy and as long as we simply record a message, safe !
What's about handling the task that causes the child's process to crash?
One way I can think f now it to change MasterWorker to fork a separate process for each task, up to a maximun number of concurrent ones. Like we do in PublisherMaster. Instead of the current fixed pool of slaves which get work from a shared queue.
I support this idea, also solve https://github.com/dmwm/CRABServer/issues/8350.
We can divide into 2 steps, first is wrap the work() in Worker.py
to execute in grandchild process. If this work well, refactor the code at MasterWorker to spawn child each task
One concern is about performance. It will fork grandchild for every task, which costs a lot of CPU overhead.
It's also true that TW actions take at least a few seconds, spawning a new process for every task may not be a that much more cpu usage in the grand scheme of things
we discussed in the meeting and agreed on Wa's suggestion and can go straight for forking one child to handle each task and take care of timeout at same time. Will open ad-hoc issue
I am closing this on:
see e.g. https://s3.cern.ch/crabcache_prod/piperov/240517_215114%3Apiperov_crab_SONIC_MiniAOD_CRAB_testPurdue807_1x4900-1000j/twlog?AWSAccessKeyId=e57ff634b5334df9819ae3f956de5ca6&Signature=xK9VdrPWqew0wCVP2IQ3l5wPkGM%3D&Expires=1718830467
which ends with
the slave died at that time
and the task was left in QUEUED forever https://cmsweb.cern.ch/crabserver/ui/task/240517_215114%3Apiperov_crab_SONIC_MiniAOD_CRAB_testPurdue807_1x4900-1000j
Task stuck in QUEUED is bad, and crashed slave is worse !
The submission failure was due to a bad extraJDL argument on user side
in the Task Info tab of the UI
in the user config (as privately reported by @kpedro88 ) :