Open jenimal opened 6 years ago
FWIW - this is the appropriate exit code for this case.
99303 enumerates a symptom (missing job report) but doesn't narrow down a cause. The error message suggests the cause:
ERROR : Failed to execvp() /srv/.osgvo-user-job-wrapper.sh: Permission denied
We should have the wrapper detect this situation and either fake a job report or utilize an identifiable exit code.
Would that then put something into Dashboard that the site support team can cue off of for finding the site issues? Had a brief discussion with SeangChan this morning and he said it probably wouldn't, but the info would be buried in WmArchive someplace, and we don't have a way to get it out right now. We need to be able to get the following information to be useful for opening a ticket to the sites: Site error happened at - neede esp if it's happening on overflow Node the error occured on - usually it's just a couple nodes that are having the failure
thinking through the issue, if we could have a view in WmArchive that would allow us to see what site/node was failing out the jobs rapidly, this is exactly what the site support team would need to identify black holes @mapsacosta anything else you would at to that Maria?
I think getting this from WMArchive or similar is not the right direction: this is a problem in the submission infrastructure layer, so they'll need to update the HTCondor wrapper to exit with an identifiable code (and to do the normal rate-limiting).
This is about an hour or so of work on the Submission Infrastructure side - or days of hacky workarounds on the WMAgent layer.
FWIW - to detect these, here's a table of the failurs by site: https://es-cms.cern.ch/kibana/goto/216c7352773e4d239c57b5ed023f60ef
Now seeing a different behavior... Dashboard is showing 8001, across sites, but WMStats and WMArchive are both showing 99303's at IN2P3 http://dashb-cms-job.cern.ch/dashboard/templates/web-job2/#user=&refresh=0&table=Jobs&p=1&records=25&activemenu=1&usr=&site=&submissiontool=wmagent&application=&activity=reprocessing&status=&check=submitted&tier=&date1=2018-03-15+09%3A13&date2=2018-03-15+13%3A13&sortby=appexitcode&scale=linear&bars=20&ce=&rb=&grid=&jobtype=&submissionui=&dataset=&submissiontype=&task=&subtoolver=&genactivity=&outputse=&appexitcode=&accesstype=&inputse=&cores=
looking at WMStats: there are 2 8001's and thousands of 99303's.. so I think we know what the real exit code is here
you have to dig a bit to get to IN2P3 failures as the site was put into drain yesterday
I don't know where the 8001 exit code comes from, but 99303 is a generic exit code (created by WMAgent) which says that the job had no job report. If we don't have any job report, we cannot yield any other exit code than this generic one.
I think Brian described (above) what the problem is (a failure during the payload bootstrap).
Here is an elog.. describing attempting to debug Dashboard is not telling the truth today: https://cms-logbook.cern.ch/elog/Workflow+processing/27509
Also related to #8473 and #8264 Dashboard is showing this:
http://dashb-cms-job.cern.ch/dashboard/templates/web-job2/#user=&refresh=0&table=Jobs&p=1&records=25&activemenu=1&usr=&site=T2_US_Florida&submissiontool=wmagent&application=&activity=reprocessing&status=&check=submitted&tier=&date1=2018-02-25+16%3A56&date2=2018-02-26+16%3A56&sortby=appexitcode&scale=linear&bars=20&ce=&rb=&grid=&jobtype=&submissionui=&dataset=&submissiontype=&task=&subtoolver=&genactivity=&outputse=&appexitcode=&accesstype=&inputse=&cores=
WmArchive is showing https://cmsweb.cern.ch/wmarchive/web/performance?metrics[]=jobstate&axes[]=host&axes[]=jobstate&axes[]=site&axes[]=exitCode&start_date=20180226&end_date=20180226&aggDB=aggregated&aggCol=hour&site=T2_US_Florida
When I look for the actual error message in WmStats I see: