Open paorozo opened 7 years ago
What sort of additional site info would you like? Right now, the color of the cell for the site indicates the site readiness status.
For example, the workflow pdmvserv_task_HIG-RunIISummer15wmLHEGS-01101__v1_T_170223_204023_1926 had submit failures to T2_DE_RWTH. In the summary table, I would like to see the sites where the job tried to run (RWTH in this case). At some point, when we will apply an action to multiple, it would be very useful to group not only by exit code but by the site involved.
@prozober Can you check out #28 in particular the procedures.py file: https://github.com/CMSCompOps/WorkflowWebTools/pull/28/files#diff-8700d1405003610d0b06a3bb1492f007 (The other things are just bug fixed that I had caught and thrown into my master to patch the server.)
The 're' key makes it parse the 71104 error logs for the line that matches 'The job can run only at .*(T[12].*)', with the group at the end extracting the site name and feeding it to the operator. Let me know what other operator instructions you would want. I don't want to merge this without your feedback.
Like I said before, I'll need to start writing tests for these things, but I would expect this to work the next time a 71104 error comes up...
I am not sure if I got this. For example, we have this workflow https://vocms0113.cern.ch:80/seeworkflow?workflow=pdmvserv_task_HIG-RunIISummer15GS-01999__v1_T_170316_205819_3815
I only see "71102" (and "-1") for the first two tasks. The "71104" error is thrown for the third task, which does have T1_US_FNAL selected by default. I should clean up these tables to make them easier to read...
Also, the 71102 logs only mention T0_CH_CERN: https://vocms0113.cern.ch:80/explainerror?errorcode=71102&workflowstep=/pdmvserv_task_HIG-RunIISummer15GS-01999__v1_T_170316_205819_3815/HIG-RunIISummer15GS-01999_0/HIG-RunIISummer15GS-01999_0MergeRAWSIMoutput/HIG-RunIISummer16DR80Premix-02408_0/HIG-RunIISummer16DR80Premix-02408_1
oh, stupid me checked the wrong column, and when I clicked on the Exit Code, it displays the 71104 logs for the whole workflow. Could we get the logs only for the task involved?
As you said, the third task had the "71104" error and the FNAL checkbox is selected, but, the 2 failed jobs are located in the "Unknown" column, could they be located at the FNAL column?
Thanks Dan.
Both of those things are certainly possible. Changing the columns is sort of non-trivial since it involves the structure of the all_errors.json file, but I can leave this issue open to work on it.
The easiest thing for me to do is probably to check the logs for sites whenever "unknown" is listed once I start rebuilding the all_errors.json.
Dan, it would be great if we do not include in the tables the sites where the workflow/task did not fail, it would be easier to read. What do you think? should I create a new GH issue?
This is to remind myself to do this:
check the logs for sites whenever "unknown" is listed once I start rebuilding the all_errors.json
Every exit code 71104 must be related to one site, as you can see in the details of the exit codes. https://cmsweb.cern.ch/wmstatsserver/data/jobdetail/pdmvserv_task_HIG-RunIISummer15wmLHEGS-01101__v1_T_170223_204023_1926
Could we, somehow, include the site info inside our "summary table" in here: https://vocms0113.cern.ch:80/seeworkflow?workflow=pdmvserv_task_HIG-RunIISummer15wmLHEGS-01101__v1_T_170223_204023_1926