Chirp_WMCore_cmsRun_ExitCode can be wrong for removed multi-step jobs

amaltaro commented 1 year ago

Impact of the bug WMAgent (also in production at the moment)

Describe the bug As pointed out in this discussion: https://github.com/dmwm/cms-htcondor-es/pull/210#issuecomment-1628382134

there is a scenario where a failed job will report Chirp_WMCore_cmsRun_ExitCode=0. It can happen to jobs that have completed at least 1 cmsRun step (successfully) and then get suddenly removed.

The reason for that is that every time a cmsRun step completes, we chirp the exit code of the step Chirp_WMCore_cmsRunXXX_ExitCode and the job exit code Chirp_WMCore_cmsRun_ExitCode, as can be seen in (even though the job is not yet complete: https://github.com/vkuznet/WMCore/blob/c57e5e5eff4a922217fd8a44da6c16c3627089dc/src/python/WMCore/WMSpec/Steps/Executors/CMSSW.py#L52C9-L52C19

To illustrate this problem, imagine that a job with 3 steps is progressing like:

# running step 1 - cmsRun1
Chirp_WMCore_cmsRun1_ExitCode=0
Chirp_WMCore_cmsRun_ExitCode=0
# running step 2 - cmsRun2
Chirp_WMCore_cmsRun2_ExitCode=0
Chirp_WMCore_cmsRun_ExitCode=0
# running step 3 - cmsRun3 - but then the batch system kills it or something else condor_rm it
# which means, we don't chirp anything else for this job

When this job classads get parsed by the HTCondor monitoring script: https://github.com/dmwm/cms-htcondor-es/blob/master/src/htcondor_es/convert_to_json.py

it would be marked as successful, given that Chirp_WMCore_cmsRun_ExitCode is 0.

we seem to have a regression injected by this PR: https://github.com/dmwm/WMCore/pull/11581

where a multi step job that fail will still report Chirp_WMCore_cmsRun_ExitCode=0, instead of the actual first non-0 exit code.

This assumption above is wrong only if we actually end up with JobExitCode != 0, but looking into the classads I cannot even find it for the jobs.

How to reproduce it Hard fail (remove) a StepChain job after it succeeded at least cmsRun1.

Expected behavior The solution for this would be to start a job with Chirp_WMCore_cmsRun_ExitCode !=0 and only set it to 0 after all the job steps have successfully executed.

The problem is that each step has its own context - they are independent - so we would likely need to have a post cmsRun steps processing that would parse all of the relevant step reports and decide the exit code for Chirp_WMCore_cmsRun_ExitCode, chirping it one last time.

Additional context and error message Here is an opensearch monit dashboard that I was playing with: https://es-cms.cern.ch/dashboards/goto/e60ade2f5aa7415384270dd437337ccf?security_tenant=global

which BTW does not have any exit code for cmsRun5, cmsRun6 and cmsRun7. I did check the job logs and I can find them being chirp'ed. Anyways, that's a different problem.

amaltaro commented 1 year ago

@khurtado Kenyi, can you please prove me wrong on this ticket?

Given that you have been working closely with this, could you please also clarify the definition/difference of: 1) JobExitCode: is it even set? It is also not documented in cms-htcondor-es 2) ExitCode: set by cms-htcondor-es? 3) Chirp_WMCore_cmsRun_ExitCode: this one is under our control and it (is supposed to) says whether all of the cmsRun steps succeeded or not.

nikodemas commented 1 year ago

Hi @amaltaro,

I can try to answer the first two of your questions.

Looking at our indexes the JobExitCode exists in some records (see dashboard, for example over the last month the field existed in 5% of completed jobs), but as you correctly pointed out it is not set in cms-htcondor-es. However, it is used to derive ExitCode field (see code).
ExitCode is set by cms-htcondor-es (see this line), but based on the if statement below it seems that the field can also be set before reaching cms-htcondor-es and later is renamed to CondorExitCode. Based on the data about the completed jobs over the last month this CondorExitCode field existed majority of the arriving information meaning that it is most often set before (see dashboard&_a=(columns:!(CondorExitCode),filters:!(),index:'0f3117a0-d2e6-11ed-bdae-e3cf8c07f2fc',interval:auto,query:(language:lucene,query:'exists:CondorExitCode'),sort:!()))).

khurtado commented 1 year ago

@amaltaro I can see different scenarios.

JobExitCode is set by CRAB, not by us.
ExitCode is set first by HTCondor. If this exists then cms-htcondor-es renames it to CondorExitCode. Then it proceeds to set its own ExitCode based on the commonExitCode function which relies on classads we (CRAB/WMAgent) sent via chirp.
Chirp_WMCore_cmsRun_ExitCode: Yes, this is under our control.

From what I see, there is a few things we need to fix.

We should not set ExitCode=0 (via cms-htcondor-es) even if we have several Chirp_WMCore_cmsRunX_ExitCode=0 unless CondorExitCode also exists and is also 0.
- Regarding CondorExitCode existing, here is a link&_a=(columns:!(CondorExitCode,Chirp_WMCore_cmsRun_ExitCode,Chirp_WMCore_cmsRun1_ExitCode,Chirp_WMCore_cmsRun2_ExitCode,ExitCode),filters:!(),index:'0f3117a0-d2e6-11ed-bdae-e3cf8c07f2fc',interval:auto,query:(language:lucene,query:'Chirp_WMCore_cmsRun_ExitCode:0%20AND%20Chirp_WMCore_cmsRun2_ExitCode:0%20AND%20!!exists:CondorExitCode'),sort:!())) of jobs that had Chirp_WMCore_cmsRUnX_ExitCode=0, but CondorExitCode did not exist. To me, this means the wrapper did not get to finish for some reason, after cmsRun. Looking at the RemoveReason of those jobs, they show:
  - Python-initiated action. (by user cmst1)
  - The system macro SYSTEM_PERIODIC_REMOVE expression '(((NumJobStarts > 9) =?= true) || ((NumShadowStarts > 19) =?= true) || (DiskUsage > ((RequestCpus * 20000000) + 7000000)))' evaluated to TRUE Which means even though we managed to run a lot (or maybe even all) of the cmsRun steps, the job was killed either manually or by the schedd, hence they are removed jobs, not successful jobs.
  - Regarding CondorExitCode needing to be 0, here is another link&_a=(columns:!(CondorExitCode,Chirp_WMCore_cmsRun_ExitCode,Chirp_WMCore_cmsRun1_ExitCode,Chirp_WMCore_cmsRun2_ExitCode,ExitCode),filters:!(),index:'0f3117a0-d2e6-11ed-bdae-e3cf8c07f2fc',interval:auto,query:(language:lucene,query:'Chirp_WMCore_cmsRun_ExitCode:0%20AND%20Chirp_WMCore_cmsRun2_ExitCode:%3E0%20AND%20CondorExitCode:%3E0'),sort:!())) of jobs that had ExitCode=0, Chirp_WMCore_cmsRun_ExitCode=0 but CondorExitCode!=0. These are failed jobs and we should modify commonExitCode() to report something other than 0 for that case, based on CondorExitCode. The reason is the full wrapper didn't get to finish for some reason, so the full job per-se was not successful.

vkuznet commented 1 year ago

Due to such complexity of exit code placement would be more beneficial to setup all individual exit codes in different places and let CMS Monitoring define proper logic of final exit code? For instance, as far as provided examples we have WMCore exit codes (for individual steps), we have CRAB one, we have HTCondor ones, etc. So if all of them are presented in document which we push to CMS Monitoring (or CMS Monitoring read it as classAds) then it seems more easier and natural to define final exit code within CMS Monitoring such that we can place all necessary logic and conditions over there. Is it better solution in this case?

khurtado commented 1 year ago

@vkuznet Yes, there is such logic here in cms-htcondor-es but we have only been using the CRAB/WMCore and ExitCode classads, AFAIK. In my opinion, incorporating the condor classad CondorExitCode into this logic would most likely solve our issues without creating this post cmsRun step proposed in the issue. @amaltaro what do you think? this would not really fix the Chirp_WMCore_cmsRun_ExitCode issue but it would translate in a more accurate ExitCode which is what we need at the end to categorize successful vs failed/removed jobs.

amaltaro commented 1 year ago

Hi everyone, thank you for your feedback!

Yes, it looks like we have to incorporate CondorExitCode as part of the logic deciding whether a job succeeded or not - from the monitoring perspective. Thus not relying solely on the ExitCode or on the many Chirp_WMCore_cmsRunX_ExitCode attributes.

Let me throw a different idea though. Jobs that exit condor are finally evaluated by the JobAccountant component in the agent. What if we had the ability to send a final and reliable JobExitCode for each job that we are parsing? This way we would be in sync with how CRAB does this as well. I have to say though, I do not know what the effort it would take and whether we could make an asynchronous update to a given condor job id.

nikodemas commented 1 year ago

In my opinion, incorporating the condor classad CondorExitCode into this logic would most likely solve our issues without creating this post cmsRun step proposed in the issue.

@khurtado @amaltaro CondorExitCode is already incorporated into the logic of CMS Monitoring ExitCode because commonExitCode(ad) uses ExitCode field which is only later renamed to CondorExitCode. I think the reason why in the dashboard from @khurtado we have jobs where ExitCode=0 but CondorExitCode!=0 is because the commonExitCode(ad) function first checks for JobExitCode, Chirp_CRAB3_Job_ExitCode and Chirp_WMCore_cmsRun_ExitCode values and only if these are not set it proceeds to take value from ExitCode (CondorExitCode). So in Kenyi's example we simply found a 0 code before even reaching CondorExitCode. Maybe then the order of these checks should be changed?

amaltaro commented 1 year ago

If we can't have the same set of exit codes in both production and CRAB, I think the cleanest approach would be to separate them in condor monitoring. Identify if the job is production or crab, and deal with the exit codes accordingly. It would make the code much more readable IMO.

khurtado commented 1 year ago

@nikodemas @amaltaro Thanks! So, we basically need to start with the fact that if ad.get("ExitCode")is None, we return 50666; even if the other production/CRAB classads exist. Then, to make it cleaner, we follow @amaltaro suggestion to identify production vs CRAB classads and follow the logic according to each case, in order to find the final exit code (for instance in production, even if the chirp classads are all 0s, we want to report the raw condor ExitCode if this is anything other than 0, because that means stageOut or something else failed after some of all of the cmsRun executables finished).

amaltaro commented 1 year ago

This issue has been addressed with: https://github.com/dmwm/cms-htcondor-es/pull/212 and https://github.com/dmwm/cms-htcondor-es/pull/210

Thanks Kenyi!

dmwm / WMCore

Chirp_WMCore_cmsRun_ExitCode can be wrong for removed multi-step jobs #11654