dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Corrupted job file? #7755

Open johnhcasallasl opened 7 years ago

johnhcasallasl commented 7 years ago

Hello:

We found a Repack job failing when trying to unpack the job file [1]. It only happened for this job, but it raises the question of how should we recover from this problem if we see it happening again. This job is paused on vocms0314 in case you need to check it [2].

[1] WMAgent bootstrap : Mon Mar 27 11:09:53 UTC 2017 : starting... WMAgent bootstrap : Mon Mar 27 11:09:53 UTC 2017 : arguments validated... WMAgent bootstrap : Mon Mar 27 11:09:53 UTC 2017 : WMAgent thinks it found the correct CMSSW setup script WMAgent bootstrap : Mon Mar 27 11:09:53 UTC 2017 : found python2 at.. /cvmfs/cms.cern.ch/COMP/slc6_amd64_gcc493/external/python/2.7.6/bin/python2 WMAgent bootstrap : Mon Mar 27 11:09:53 UTC 2017 : is unpacking the job... Unable to create job area for bootstrap compressed file ended before the logical end-of-stream was detectedTraceback (most recent call last): File "Unpacker.py", line 124, in runUnpacker jobArea = createWorkArea(sandbox) File "Unpacker.py", line 80, in createWorkArea tfile.extractall(jobDir) File "/cvmfs/cms.cern.ch/COMP/slc6_amd64_gcc493/external/python/2.7.6/lib/python2.7/tarfile.py", line 2047, in extractall self.extract(tarinfo, path) File "/cvmfs/cms.cern.ch/COMP/slc6_amd64_gcc493/external/python/2.7.6/lib/python2.7/tarfile.py", line 2084, in extract self._extract_member(tarinfo, os.path.join(path, tarinfo.name)) File "/cvmfs/cms.cern.ch/COMP/slc6_amd64_gcc493/external/python/2.7.6/lib/python2.7/tarfile.py", line 2160, in _extract_member self.makefile(tarinfo, targetpath) File "/cvmfs/cms.cern.ch/COMP/slc6_amd64_gcc493/external/python/2.7.6/lib/python2.7/tarfile.py", line 2201, in makefile copyfileobj(source, target) File "/cvmfs/cms.cern.ch/COMP/slc6_amd64_gcc493/external/python/2.7.6/lib/python2.7/tarfile.py", line 266, in copyfileobj shutil.copyfileobj(src, dst) File "/cvmfs/cms.cern.ch/COMP/slc6_amd64_gcc493/external/python/2.7.6/lib/python2.7/shutil.py", line 49, in copyfileobj buf = fsrc.read(length) File "/cvmfs/cms.cern.ch/COMP/slc6_amd64_gcc493/external/python/2.7.6/lib/python2.7/tarfile.py", line 819, in read buf += self.fileobj.read(size - len(buf)) File "/cvmfs/cms.cern.ch/COMP/slc6_amd64_gcc493/external/python/2.7.6/lib/python2.7/tarfile.py", line 737, in read return self.readnormal(size) File "/cvmfs/cms.cern.ch/COMP/slc6_amd64_gcc493/external/python/2.7.6/lib/python2.7/tarfile.py", line 746, in readnormal return self.fileobj.read(size) EOFError: compressed file ended before the logical end-of-stream was detected

[2] /data/tier0/srv/wmagent/2.0.5/install/tier0/JobCreator/JobCache/Repack_Run288593_StreamPhysics/Repack/JobCollection_971608_0/job_3557741

amaltaro commented 7 years ago

I could reproduce the same untar issue. Actually, one can open the spec tarball /data/tier0/admin/Specs/Repack_Run288593_StreamPhysics/Repack_Run288593_StreamPhysics-Sandbox.tar.bz2

with vim and it reports the file is corrupted.

What puzzles me is how come the other jobs - for the same request - managed to untar and run?!? @johnhcasallasl can you confirm whether this workflow had other jobs injected and running without any issues please?

hufnagel commented 7 years ago

We seem to see this happening every once in a while, that the workflow sandbox tar produces a corrupted tarball without the system noticing except when a job wants to use it.

I simplified the tarball creation logic in JobCreator a while ago to possibly help with this. Since it seems to continue to happen, how about adding an explicit check that the produced tarball is ok (by reading and expanding it after creation) ?

amaltaro commented 7 years ago

That's the approach I was thinking of as well. However, John said other jobs within this same workflow succeeded(?) How come if it use exactly the same tarball? @johnhcasallasl please clarify

johnhcasallasl commented 7 years ago

Looking at the DB, that is the only job for that workflow.

amaltaro commented 7 years ago

@johnhcasallasl I remade that tarball based on another Repack sandbox. Can you please replace: /data/tier0/admin/Specs/Repack_Run288593_StreamPhysics/Repack_Run288593_StreamPhysics-Sandbox.tar.bz2

by /data/srv/alan/remake/Repack_Run288593_StreamPhysics-Sandbox.tar.bz2 ?

Make sure to keep the correct permissions/owner.

amaltaro commented 7 years ago

and of course, un-pause that job :)

ebohorqu commented 7 years ago

@amaltaro, we copied the file generated by you to the required location and resumed the paused job. Thanks!

amaltaro commented 7 years ago

how was it, did the job succeed?

ebohorqu commented 7 years ago

We didn't observe issues with the tarball when we resumed the job, but at the end it was not able to succeed given its input streamer file was already deleted from EOS :/ We failed the job.

amaltaro commented 7 years ago

Corrupted job file?