CMSCompOps / WorkflowWebTools

https://workflowwebtools.readthedocs.io
1 stars 7 forks source link

Reporting submit failures #26

Open paorozo opened 7 years ago

paorozo commented 7 years ago

Every exit code 71104 must be related to one site, as you can see in the details of the exit codes. https://cmsweb.cern.ch/wmstatsserver/data/jobdetail/pdmvserv_task_HIG-RunIISummer15wmLHEGS-01101__v1_T_170223_204023_1926

Could we, somehow, include the site info inside our "summary table" in here: https://vocms0113.cern.ch:80/seeworkflow?workflow=pdmvserv_task_HIG-RunIISummer15wmLHEGS-01101__v1_T_170223_204023_1926

dabercro commented 7 years ago

What sort of additional site info would you like? Right now, the color of the cell for the site indicates the site readiness status.

paorozo commented 7 years ago

For example, the workflow pdmvserv_task_HIG-RunIISummer15wmLHEGS-01101__v1_T_170223_204023_1926 had submit failures to T2_DE_RWTH. In the summary table, I would like to see the sites where the job tried to run (RWTH in this case). At some point, when we will apply an action to multiple, it would be very useful to group not only by exit code but by the site involved.

dabercro commented 7 years ago

@prozober Can you check out #28 in particular the procedures.py file: https://github.com/CMSCompOps/WorkflowWebTools/pull/28/files#diff-8700d1405003610d0b06a3bb1492f007 (The other things are just bug fixed that I had caught and thrown into my master to patch the server.)

The 're' key makes it parse the 71104 error logs for the line that matches 'The job can run only at .*(T[12].*)', with the group at the end extracting the site name and feeding it to the operator. Let me know what other operator instructions you would want. I don't want to merge this without your feedback.

Like I said before, I'll need to start writing tests for these things, but I would expect this to work the next time a 71104 error comes up...

paorozo commented 7 years ago

I am not sure if I got this. For example, we have this workflow https://vocms0113.cern.ch:80/seeworkflow?workflow=pdmvserv_task_HIG-RunIISummer15GS-01999__v1_T_170316_205819_3815

dabercro commented 7 years ago

I only see "71102" (and "-1") for the first two tasks. The "71104" error is thrown for the third task, which does have T1_US_FNAL selected by default. I should clean up these tables to make them easier to read...

dabercro commented 7 years ago

Also, the 71102 logs only mention T0_CH_CERN: https://vocms0113.cern.ch:80/explainerror?errorcode=71102&workflowstep=/pdmvserv_task_HIG-RunIISummer15GS-01999__v1_T_170316_205819_3815/HIG-RunIISummer15GS-01999_0/HIG-RunIISummer15GS-01999_0MergeRAWSIMoutput/HIG-RunIISummer16DR80Premix-02408_0/HIG-RunIISummer16DR80Premix-02408_1

paorozo commented 7 years ago

oh, stupid me checked the wrong column, and when I clicked on the Exit Code, it displays the 71104 logs for the whole workflow. Could we get the logs only for the task involved?

As you said, the third task had the "71104" error and the FNAL checkbox is selected, but, the 2 failed jobs are located in the "Unknown" column, could they be located at the FNAL column?

Thanks Dan.

dabercro commented 7 years ago

Both of those things are certainly possible. Changing the columns is sort of non-trivial since it involves the structure of the all_errors.json file, but I can leave this issue open to work on it.

The easiest thing for me to do is probably to check the logs for sites whenever "unknown" is listed once I start rebuilding the all_errors.json.

paorozo commented 7 years ago

Dan, it would be great if we do not include in the tables the sites where the workflow/task did not fail, it would be easier to read. What do you think? should I create a new GH issue?

dabercro commented 7 years ago

32 Changes the workflow error table.

dabercro commented 6 years ago

This is to remind myself to do this:

check the logs for sites whenever "unknown" is listed once I start rebuilding the all_errors.json