dmwm / CRABServer

15 stars 38 forks source link

improve reporing of jobs which run of time 50664 #7001

Open belforte opened 2 years ago

belforte commented 2 years ago

currently job stdout is lilike

== DIR: sandbox.tar.gz
== DIR:
==== Local directory contents dump FINISHING ====
======== STARTING at Sat Jan 22 11:52:19 GMT 2022 ========
Now running the job in /srv...
++ pwd
+ python -r /srv -a sandbox.tar.gz --sourceURL= --jobNumber=280 --cmsswVersion=CMSSW_12_0_1 --scramArch=slc7_amd64_gcc900 --inputFile=job_input_file_list_280.txt --runAndLumis=job_lumis_280.json --lheInputFiles=False --firstEvent=None --firstLumi=None --lastEvent=None --firstRun=None --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=None --maxRuntime=-60 '--scriptArgs=[]' -o '{}' --oneEventMode=0
======== Figuring out long exit code of the job for condor_chirp ========
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "/usr/lib64/python2.7/json/", line 290, in load
  File "/usr/lib64/python2.7/json/", line 338, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.7/json/", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python2.7/json/", line 384, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
==== Failed to load the long exit code from jobReport.json.280. Falling back to short exit code ====
======== Short exit code also missing. Settint exit code to 80001 ========
======== Finished condor_chirp -ing the exit code of the job. Exit code of condor_chirp: 0 ========
Job Running time in seconds:  3601

should check if there is a way to also print a possible cmsRUn stdout fragment, hard to think that it never started

belforte commented 1 year ago

at the very least add time stamp to/before line

======== Figuring out long exit code of the job for condor_chirp ========

so that it is immediately clear what happened. Could also compare to max wall time (in classAds) and directly set 50664 here instead of 80001