Open johnhcasallasl opened 7 years ago
I could reproduce the same untar issue. Actually, one can open the spec tarball /data/tier0/admin/Specs/Repack_Run288593_StreamPhysics/Repack_Run288593_StreamPhysics-Sandbox.tar.bz2
with vim
and it reports the file is corrupted.
What puzzles me is how come the other jobs - for the same request - managed to untar and run?!? @johnhcasallasl can you confirm whether this workflow had other jobs injected and running without any issues please?
We seem to see this happening every once in a while, that the workflow sandbox tar produces a corrupted tarball without the system noticing except when a job wants to use it.
I simplified the tarball creation logic in JobCreator a while ago to possibly help with this. Since it seems to continue to happen, how about adding an explicit check that the produced tarball is ok (by reading and expanding it after creation) ?
That's the approach I was thinking of as well. However, John said other jobs within this same workflow succeeded(?) How come if it use exactly the same tarball? @johnhcasallasl please clarify
Looking at the DB, that is the only job for that workflow.
@johnhcasallasl I remade that tarball based on another Repack sandbox. Can you please replace:
/data/tier0/admin/Specs/Repack_Run288593_StreamPhysics/Repack_Run288593_StreamPhysics-Sandbox.tar.bz2
by
/data/srv/alan/remake/Repack_Run288593_StreamPhysics-Sandbox.tar.bz2
?
Make sure to keep the correct permissions/owner.
and of course, un-pause that job :)
@amaltaro, we copied the file generated by you to the required location and resumed the paused job. Thanks!
how was it, did the job succeed?
We didn't observe issues with the tarball when we resumed the job, but at the end it was not able to succeed given its input streamer file was already deleted from EOS :/ We failed the job.
Hello:
We found a Repack job failing when trying to unpack the job file [1]. It only happened for this job, but it raises the question of how should we recover from this problem if we see it happening again. This job is paused on vocms0314 in case you need to check it [2].
[1] WMAgent bootstrap : Mon Mar 27 11:09:53 UTC 2017 : starting... WMAgent bootstrap : Mon Mar 27 11:09:53 UTC 2017 : arguments validated... WMAgent bootstrap : Mon Mar 27 11:09:53 UTC 2017 : WMAgent thinks it found the correct CMSSW setup script WMAgent bootstrap : Mon Mar 27 11:09:53 UTC 2017 : found python2 at.. /cvmfs/cms.cern.ch/COMP/slc6_amd64_gcc493/external/python/2.7.6/bin/python2 WMAgent bootstrap : Mon Mar 27 11:09:53 UTC 2017 : is unpacking the job... Unable to create job area for bootstrap compressed file ended before the logical end-of-stream was detectedTraceback (most recent call last): File "Unpacker.py", line 124, in runUnpacker jobArea = createWorkArea(sandbox) File "Unpacker.py", line 80, in createWorkArea tfile.extractall(jobDir) File "/cvmfs/cms.cern.ch/COMP/slc6_amd64_gcc493/external/python/2.7.6/lib/python2.7/tarfile.py", line 2047, in extractall self.extract(tarinfo, path) File "/cvmfs/cms.cern.ch/COMP/slc6_amd64_gcc493/external/python/2.7.6/lib/python2.7/tarfile.py", line 2084, in extract self._extract_member(tarinfo, os.path.join(path, tarinfo.name)) File "/cvmfs/cms.cern.ch/COMP/slc6_amd64_gcc493/external/python/2.7.6/lib/python2.7/tarfile.py", line 2160, in _extract_member self.makefile(tarinfo, targetpath) File "/cvmfs/cms.cern.ch/COMP/slc6_amd64_gcc493/external/python/2.7.6/lib/python2.7/tarfile.py", line 2201, in makefile copyfileobj(source, target) File "/cvmfs/cms.cern.ch/COMP/slc6_amd64_gcc493/external/python/2.7.6/lib/python2.7/tarfile.py", line 266, in copyfileobj shutil.copyfileobj(src, dst) File "/cvmfs/cms.cern.ch/COMP/slc6_amd64_gcc493/external/python/2.7.6/lib/python2.7/shutil.py", line 49, in copyfileobj buf = fsrc.read(length) File "/cvmfs/cms.cern.ch/COMP/slc6_amd64_gcc493/external/python/2.7.6/lib/python2.7/tarfile.py", line 819, in read buf += self.fileobj.read(size - len(buf)) File "/cvmfs/cms.cern.ch/COMP/slc6_amd64_gcc493/external/python/2.7.6/lib/python2.7/tarfile.py", line 737, in read return self.readnormal(size) File "/cvmfs/cms.cern.ch/COMP/slc6_amd64_gcc493/external/python/2.7.6/lib/python2.7/tarfile.py", line 746, in readnormal return self.fileobj.read(size) EOFError: compressed file ended before the logical end-of-stream was detected
[2] /data/tier0/srv/wmagent/2.0.5/install/tier0/JobCreator/JobCache/Repack_Run288593_StreamPhysics/Repack/JobCollection_971608_0/job_3557741