Closed amaltaro closed 1 year ago
@khurtado Kenyi, can you please prove me wrong on this ticket?
Given that you have been working closely with this, could you please also clarify the definition/difference of:
1) JobExitCode
: is it even set? It is also not documented in cms-htcondor-es
2) ExitCode
: set by cms-htcondor-es?
3) Chirp_WMCore_cmsRun_ExitCode
: this one is under our control and it (is supposed to) says whether all of the cmsRun steps succeeded or not.
Hi @amaltaro,
I can try to answer the first two of your questions.
JobExitCode
exists in some records (see dashboard, for example over the last month the field existed in 5% of completed jobs), but as you correctly pointed out it is not set in cms-htcondor-es
. However, it is used to derive ExitCode
field (see code).ExitCode
is set by cms-htcondor-es
(see this line), but based on the if
statement below it seems that the field can also be set before reaching cms-htcondor-es
and later is renamed to CondorExitCode
. Based on the data about the completed jobs over the last month this CondorExitCode
field existed majority of the arriving information meaning that it is most often set before (see dashboard&_a=(columns:!(CondorExitCode),filters:!(),index:'0f3117a0-d2e6-11ed-bdae-e3cf8c07f2fc',interval:auto,query:(language:lucene,query:'exists:CondorExitCode'),sort:!()))).@amaltaro I can see different scenarios.
JobExitCode
is set by CRAB, not by us.cms-htcondor-es
renames it to CondorExitCode
. Then it proceeds to set its own ExitCode
based on the commonExitCode
function which relies on classads we (CRAB/WMAgent) sent via chirp.Chirp_WMCore_cmsRun_ExitCode: Yes, this is under our control.
From what I see, there is a few things we need to fix.
We should not set ExitCode=0
(via cms-htcondor-es
) even if we have several Chirp_WMCore_cmsRunX_ExitCode=0
unless CondorExitCode
also exists and is also 0.
Regarding CondorExitCode
existing, here is a link&_a=(columns:!(CondorExitCode,Chirp_WMCore_cmsRun_ExitCode,Chirp_WMCore_cmsRun1_ExitCode,Chirp_WMCore_cmsRun2_ExitCode,ExitCode),filters:!(),index:'0f3117a0-d2e6-11ed-bdae-e3cf8c07f2fc',interval:auto,query:(language:lucene,query:'Chirp_WMCore_cmsRun_ExitCode:0%20AND%20Chirp_WMCore_cmsRun2_ExitCode:0%20AND%20!!exists:CondorExitCode'),sort:!())) of jobs that had Chirp_WMCore_cmsRUnX_ExitCode=0
, but CondorExitCode
did not exist. To me, this means the wrapper did not get to finish for some reason, after cmsRun. Looking at the RemoveReason
of those jobs, they show:
Python-initiated action. (by user cmst1)
The system macro SYSTEM_PERIODIC_REMOVE expression '(((NumJobStarts > 9) =?= true) || ((NumShadowStarts > 19) =?= true) || (DiskUsage > ((RequestCpus * 20000000) + 7000000)))' evaluated to TRUE Which means even though we managed to run a lot (or maybe even all) of the cmsRun steps, the job was killed either manually or by the schedd, hence they are removed jobs, not successful jobs.
Regarding CondorExitCode
needing to be 0, here is another link&_a=(columns:!(CondorExitCode,Chirp_WMCore_cmsRun_ExitCode,Chirp_WMCore_cmsRun1_ExitCode,Chirp_WMCore_cmsRun2_ExitCode,ExitCode),filters:!(),index:'0f3117a0-d2e6-11ed-bdae-e3cf8c07f2fc',interval:auto,query:(language:lucene,query:'Chirp_WMCore_cmsRun_ExitCode:0%20AND%20Chirp_WMCore_cmsRun2_ExitCode:%3E0%20AND%20CondorExitCode:%3E0'),sort:!())) of jobs that had ExitCode=0, Chirp_WMCore_cmsRun_ExitCode=0
but CondorExitCode!=0
. These are failed jobs and we should modify commonExitCode()
to report something other than 0 for that case, based on CondorExitCode
. The reason is the full wrapper didn't get to finish for some reason, so the full job per-se was not successful.
Due to such complexity of exit code placement would be more beneficial to setup all individual exit codes in different places and let CMS Monitoring define proper logic of final exit code? For instance, as far as provided examples we have WMCore exit codes (for individual steps), we have CRAB one, we have HTCondor ones, etc. So if all of them are presented in document which we push to CMS Monitoring (or CMS Monitoring read it as classAds) then it seems more easier and natural to define final exit code within CMS Monitoring such that we can place all necessary logic and conditions over there. Is it better solution in this case?
@vkuznet Yes, there is such logic here in cms-htcondor-es
but we have only been using the CRAB/WMCore and ExitCode classads, AFAIK. In my opinion, incorporating the condor classad CondorExitCode
into this logic would most likely solve our issues without creating this post cmsRun step proposed in the issue. @amaltaro what do you think? this would not really fix the Chirp_WMCore_cmsRun_ExitCode
issue but it would translate in a more accurate ExitCode
which is what we need at the end to categorize successful vs failed/removed jobs.
Hi everyone, thank you for your feedback!
Yes, it looks like we have to incorporate CondorExitCode
as part of the logic deciding whether a job succeeded or not - from the monitoring perspective. Thus not relying solely on the ExitCode
or on the many Chirp_WMCore_cmsRunX_ExitCode
attributes.
Let me throw a different idea though. Jobs that exit condor are finally evaluated by the JobAccountant component in the agent. What if we had the ability to send a final and reliable JobExitCode
for each job that we are parsing? This way we would be in sync with how CRAB does this as well. I have to say though, I do not know what the effort it would take and whether we could make an asynchronous update to a given condor job id.
In my opinion, incorporating the condor classad
CondorExitCode
into this logic would most likely solve our issues without creating this post cmsRun step proposed in the issue.
@khurtado @amaltaro CondorExitCode
is already incorporated into the logic of CMS Monitoring ExitCode
because commonExitCode(ad)
uses ExitCode
field which is only later renamed to CondorExitCode
. I think the reason why in the dashboard from @khurtado we have jobs where ExitCode=0
but CondorExitCode!=0
is because the commonExitCode(ad)
function first checks for JobExitCode
, Chirp_CRAB3_Job_ExitCode
and Chirp_WMCore_cmsRun_ExitCode
values and only if these are not set it proceeds to take value from ExitCode
(CondorExitCode
). So in Kenyi's example we simply found a 0
code before even reaching CondorExitCode
. Maybe then the order of these checks should be changed?
If we can't have the same set of exit codes in both production and CRAB, I think the cleanest approach would be to separate them in condor monitoring. Identify if the job is production or crab, and deal with the exit codes accordingly. It would make the code much more readable IMO.
@nikodemas @amaltaro Thanks! So, we basically need to start with the fact that if ad.get("ExitCode")
is None
, we return 50666
; even if the other production/CRAB classads exist. Then, to make it cleaner, we follow @amaltaro suggestion to identify production vs CRAB classads and follow the logic according to each case, in order to find the final exit code (for instance in production, even if the chirp classads are all 0s, we want to report the raw condor ExitCode if this is anything other than 0, because that means stageOut or something else failed after some of all of the cmsRun executables finished).
This issue has been addressed with: https://github.com/dmwm/cms-htcondor-es/pull/212 and https://github.com/dmwm/cms-htcondor-es/pull/210
Thanks Kenyi!
Impact of the bug WMAgent (also in production at the moment)
Describe the bug As pointed out in this discussion: https://github.com/dmwm/cms-htcondor-es/pull/210#issuecomment-1628382134
there is a scenario where a failed job will report
Chirp_WMCore_cmsRun_ExitCode=0
. It can happen to jobs that have completed at least 1 cmsRun step (successfully) and then get suddenly removed.The reason for that is that every time a cmsRun step completes, we chirp the exit code of the step
Chirp_WMCore_cmsRunXXX_ExitCode
and the job exit codeChirp_WMCore_cmsRun_ExitCode
, as can be seen in (even though the job is not yet complete: https://github.com/vkuznet/WMCore/blob/c57e5e5eff4a922217fd8a44da6c16c3627089dc/src/python/WMCore/WMSpec/Steps/Executors/CMSSW.py#L52C9-L52C19To illustrate this problem, imagine that a job with 3 steps is progressing like:
When this job classads get parsed by the HTCondor monitoring script: https://github.com/dmwm/cms-htcondor-es/blob/master/src/htcondor_es/convert_to_json.py
it would be marked as successful, given that
Chirp_WMCore_cmsRun_ExitCode
is 0.we seem to have a regression injected by this PR: https://github.com/dmwm/WMCore/pull/11581
where a multi step job that fail will still report
Chirp_WMCore_cmsRun_ExitCode=0
, instead of the actual first non-0 exit code.This assumption above is wrong only if we actually end up with
JobExitCode != 0
, but looking into the classads I cannot even find it for the jobs.How to reproduce it Hard fail (remove) a StepChain job after it succeeded at least cmsRun1.
Expected behavior The solution for this would be to start a job with Chirp_WMCore_cmsRun_ExitCode !=0 and only set it to 0 after all the job steps have successfully executed.
The problem is that each step has its own context - they are independent - so we would likely need to have a post cmsRun steps processing that would parse all of the relevant step reports and decide the exit code for Chirp_WMCore_cmsRun_ExitCode, chirping it one last time.
Additional context and error message Here is an opensearch monit dashboard that I was playing with: https://es-cms.cern.ch/dashboards/goto/e60ade2f5aa7415384270dd437337ccf?security_tenant=global
which BTW does not have any exit code for cmsRun5, cmsRun6 and cmsRun7. I did check the job logs and I can find them being chirp'ed. Anyways, that's a different problem.