dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Dashboard showing n/a exit code WMAgent and WMArchivd showing 99303 #8491

Open jenimal opened 6 years ago

jenimal commented 6 years ago

Also related to #8473 and #8264 Dashboard is showing this:

http://dashb-cms-job.cern.ch/dashboard/templates/web-job2/#user=&refresh=0&table=Jobs&p=1&records=25&activemenu=1&usr=&site=T2_US_Florida&submissiontool=wmagent&application=&activity=reprocessing&status=&check=submitted&tier=&date1=2018-02-25+16%3A56&date2=2018-02-26+16%3A56&sortby=appexitcode&scale=linear&bars=20&ce=&rb=&grid=&jobtype=&submissionui=&dataset=&submissiontype=&task=&subtoolver=&genactivity=&outputse=&appexitcode=&accesstype=&inputse=&cores=

WmArchive is showing https://cmsweb.cern.ch/wmarchive/web/performance?metrics[]=jobstate&axes[]=host&axes[]=jobstate&axes[]=site&axes[]=exitCode&start_date=20180226&end_date=20180226&aggDB=aggregated&aggCol=hour&site=T2_US_Florida

When I look for the actual error message in WmStats I see:

NoJobReport (Exit Code: 99303)
Could not find jobReport

========== condor.26350.27.err ==========
Adding end of condor.26350.27.err to error message:

WARNING: Not mounting requested bind point (already mounted in container): /srv
WARNING: Skipping scratch directory mount, target directory does not exist: /var/tmp
WARNING: Container does not have an exec helper script, calling '/srv/.osgvo-user-job-wrapper.sh' directly
ERROR  : Failed to execvp() /srv/.osgvo-user-job-wrapper.sh: Permission denied
ABORT  : Retval = 255


========== condor.26350.27.out ==========
Adding end of condor.26350.27.out to error message:

========== condor.26350.27.log ==========
Adding end of condor.26350.27.log to error message:

    <a n="JobStatus"><i>2</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="ResidentSetSize"><i>0</i></a>
    <a n="Proc"><i>27</i></a>
    <a n="Size"><i>100</i></a>
    <a n="MachineAttrGLIDEIN_CMSSite0"><s>T2_US_Florida</s></a>
</c>
<c>
    <a n="SentBytes"><r>4.180000000000000E+02</r></a>
    <a n="Message"><s>Error from slot1_1@glidein_13121_245885200@c30a-s5.ufhpc: STARTER at 10.13.162.73 failed to send file(s) to &lt;188.184.86.128:4080&gt;: error reading from /scratch/local/17319405/17319405.cms_slurm_job/glide_TwMosw/execute/dir_28895/Report.0.pkl: (errno 2) No such file or directory; SHADOW failed to receive file(s) from &lt;128.227.90.118:33412&gt;</s></a>
    <a n="Proc"><i>27</i></a>
    <a n="Cluster"><i>26350</i></a>
    <a n="EventTime"><s>2018-02-26T17:21:29</s></a>
    <a n="ReceivedBytes"><r>1.195907000000000E+07</r></a>
    <a n="MyType"><s>ShadowExceptionEvent</s></a>
    <a n="Subproc"><i>0</i></a>
    <a n="EventTypeNumber"><i>7</i></a>
</c>
<c>
    <a n="HoldReasonCode"><i>13</i></a>
    <a n="HoldReason"><s>Error from slot1_1@glidein_13121_245885200@c30a-s5.ufhpc: STARTER at 10.13.162.73 failed to send file(s) to &lt;188.184.86.128:4080&gt;: error reading from /scratch/local/17319405/17319405.cms_slurm_job/glide_TwMosw/execute/dir_28895/Report.0.pkl: (errno 2) No such file or directory; SHADOW failed to receive file(s) from &lt;128.227.90.118:33412&gt;</s></a>
    <a n="Proc"><i>27</i></a>
    <a n="Cluster"><i>26350</i></a>
    <a n="EventTime"><s>2018-02-26T17:21:30</s></a>
    <a n="MyType"><s>JobHeldEvent</s></a>
    <a n="HoldReasonSubCode"><i>2</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="EventTypeNumber"><i>12</i></a>
</c>
<c>
    <a n="TriggerEventTypeNumber"><i>12</i></a>
    <a n="DESIRED_Sites"><s>T1_US_FNAL,T2_CH_CERN,T2_US_Caltech,T2_US_Florida,T2_US_MIT,T2_US_Nebraska,T2_US_Purdue,T2_US_UCSD,T2_US_Vanderbilt,T2_US_Wisconsin,T3_US_NERSC,T3_US_OSG</s></a>
    <a n="JobStartDate"><i>1519662079</i></a>
    <a n="ExtDESIRED_Sites"><s>T2_US_Nebraska</s></a>
    <a n="EnteredCurrentStatus"><i>1519662090</i></a>
    <a n="QDate"><i>1519657988</i></a>
    <a n="EventTypeNumber"><i>28</i></a>
    <a n="Subproc"><i>0</i></a>
    <a n="HoldReasonSubCode"><i>2</i></a>
    <a n="MyType"><s>JobAdInformationEvent</s></a>
    <a n="TriggerEventTypeName"><s>ULOG_JOB_HELD</s></a>
    <a n="EventTime"><s>2018-02-26T17:21:30</s></a>
    <a n="Cluster"><i>26350</i></a>
    <a n="MachineAttrGLIDEIN_CMSSite0"><s>T2_US_Florida</s></a>
    <a n="Proc"><i>27</i></a>
    <a n="JobStatus"><i>5</i></a>
    <a n="HoldReasonCode"><i>13</i></a>
    <a n="HoldReason"><s>Error from slot1_1@glidein_13121_245885200@c30a-s5.ufhpc: STARTER at 10.13.162.73 failed to send file(s) to &lt;188.184.86.128:4080&gt;: error reading from /scratch/local/17319405/17319405.cms_slurm_job/glide_TwMosw/execute/dir_28895/Report.0.pkl: (errno 2) No such file or directory; SHADOW failed to receive file(s) from &lt;128.227.90.118:33412&gt;</s></a>
    <a n="WMAgent_JobID"><i>3155896</i></a>
</c>
bbockelm commented 6 years ago

FWIW - this is the appropriate exit code for this case.

99303 enumerates a symptom (missing job report) but doesn't narrow down a cause. The error message suggests the cause:

ERROR  : Failed to execvp() /srv/.osgvo-user-job-wrapper.sh: Permission denied

We should have the wrapper detect this situation and either fake a job report or utilize an identifiable exit code.

jenimal commented 6 years ago

Would that then put something into Dashboard that the site support team can cue off of for finding the site issues? Had a brief discussion with SeangChan this morning and he said it probably wouldn't, but the info would be buried in WmArchive someplace, and we don't have a way to get it out right now. We need to be able to get the following information to be useful for opening a ticket to the sites: Site error happened at - neede esp if it's happening on overflow Node the error occured on - usually it's just a couple nodes that are having the failure

thinking through the issue, if we could have a view in WmArchive that would allow us to see what site/node was failing out the jobs rapidly, this is exactly what the site support team would need to identify black holes @mapsacosta anything else you would at to that Maria?

bbockelm commented 6 years ago

I think getting this from WMArchive or similar is not the right direction: this is a problem in the submission infrastructure layer, so they'll need to update the HTCondor wrapper to exit with an identifiable code (and to do the normal rate-limiting).

This is about an hour or so of work on the Submission Infrastructure side - or days of hacky workarounds on the WMAgent layer.

FWIW - to detect these, here's a table of the failurs by site: https://es-cms.cern.ch/kibana/goto/216c7352773e4d239c57b5ed023f60ef

jenimal commented 6 years ago

Now seeing a different behavior... Dashboard is showing 8001, across sites, but WMStats and WMArchive are both showing 99303's at IN2P3 http://dashb-cms-job.cern.ch/dashboard/templates/web-job2/#user=&refresh=0&table=Jobs&p=1&records=25&activemenu=1&usr=&site=&submissiontool=wmagent&application=&activity=reprocessing&status=&check=submitted&tier=&date1=2018-03-15+09%3A13&date2=2018-03-15+13%3A13&sortby=appexitcode&scale=linear&bars=20&ce=&rb=&grid=&jobtype=&submissionui=&dataset=&submissiontype=&task=&subtoolver=&genactivity=&outputse=&appexitcode=&accesstype=&inputse=&cores=

http://dashb-cms-job.cern.ch/dashboard/templates/web-job2/#user=&refresh=0&table=Jobs&p=1&records=25&activemenu=1&usr=&site=&submissiontool=wmagent&application=&activity=reprocessing&status=&check=submitted&tier=&date1=2018-03-15+09%3A13&date2=2018-03-15+13%3A13&sortby=site&scale=linear&bars=20&ce=&rb=&grid=&jobtype=&submissionui=&dataset=&submissiontype=&task=&subtoolver=&genactivity=&outputse=&appexitcode=8001&accesstype=&inputse=&cores=

looking for a task name: http://dashb-cms-job.cern.ch/dashboard/templates/web-job2/#user=&refresh=0&table=Jobs&p=1&records=25&activemenu=1&usr=&site=T2_CH_CERN_HLT&submissiontool=wmagent&application=&activity=reprocessing&status=&check=submitted&tier=&date1=2018-03-15+09%3A13&date2=2018-03-15+13%3A13&sortby=task&scale=linear&bars=20&ce=&rb=&grid=&jobtype=&submissionui=&dataset=&submissiontype=&task=&subtoolver=&genactivity=&outputse=&appexitcode=8001&accesstype=&inputse=&cores=

looking at WMStats: there are 2 8001's and thousands of 99303's.. so I think we know what the real exit code is here

https://cmsweb.cern.ch/wmarchive/web/performance?metrics[]=jobstate&axes[]=host&axes[]=jobstate&axes[]=site&start_date=20180309&end_date=20180315&aggDB=aggregated&aggCol=hour&workflow=pdmvserv_task_HIG-RunIIFall17wmLHEGS-01122__v1_T_180227_074533_1127

you have to dig a bit to get to IN2P3 failures as the site was put into drain yesterday

amaltaro commented 6 years ago

I don't know where the 8001 exit code comes from, but 99303 is a generic exit code (created by WMAgent) which says that the job had no job report. If we don't have any job report, we cannot yield any other exit code than this generic one.

I think Brian described (above) what the problem is (a failure during the payload bootstrap).

jenimal commented 6 years ago

Here is an elog.. describing attempting to debug Dashboard is not telling the truth today: https://cms-logbook.cern.ch/elog/Workflow+processing/27509