dmwm / CRABClient

15 stars 36 forks source link

crab getoutput mixes up wrong options for copy command #4658

Open belforte opened 7 years ago

belforte commented 7 years ago

Something maybe due to "thread confusion" ? See following[1] fragment of crab.log, some copy command have the correct -t <timeout> while all of a sudden one queue command has --srm-timeout <timeout> which is the one for lcg-copy. This makes no sense from the code, there is no indication that the option could be different from one file to the next one. https://github.com/dmwm/CRABClient/blob/master/src/python/CRABClient/Commands/remote_copy.py#L147-L150

The hypothesis of thread confusion is also because earlier in the log I find [2]

[1]


DEBUG 2016-12-20 15:39:00,305:      INFO 2016-12-20 15:12:57,048:    Success: Success in retrieving JpsiDMuon_v28d-Run2016G-PromptReco-v1-278819-280385d_289.root 
DEBUG 2016-12-20 15:39:00,305:      INFO 2016-12-20 15:12:57,049:    Retrieving JpsiDMuon_v28d-Run2016G-PromptReco-v1-278819-280385d_295.root 
DEBUG 2016-12-20 15:39:00,305:      DEBUG 2016-12-20 15:12:57,050:   Executing env -i X509_USER_PROXY=/tmp/x509up_u50115 gfal-copy -v -K ADLER32  -T 1800  -t 1898  srm://srm.ihep.ac.cn:8443/srm/managerv2?SFN=/pnfs/ihep.ac.cn/data/cms/store/user/bianjg/Charmonium/JpsiDMuon_v28d-Run2016G-PromptReco-v1-278819-280385d/161214_130855/0000/JpsiDMuon_v28d-Run2016G-PromptReco-v1-278819-280385d_295.root file:///publicfs/cms/data/Psi2S/work/CRAB_2_6_2/JOBS/QuadriMuon_v28/JpsiDMuon_v28d-Run2016G-PromptReco-v1-278819-280385/crab_projects/crab_JpsiDMuon_v28d-Run2016G-PromptReco-v1-278819-280385d/results/JpsiDMuon_v28d-Run2016G-PromptReco-v1-278819-280385d_295.root
DEBUG 2016-12-20 15:39:00,305:      DEBUG 2016-12-20 15:12:57,334:   Finish executing for file JpsiDMuon_v28d-Run2016G-PromptReco-v1-278819-280385d_299.root
DEBUG 2016-12-20 15:39:00,305:      INFO 2016-12-20 15:12:57,335:    Success: Success in retrieving JpsiDMuon_v28d-Run2016G-PromptReco-v1-278819-280385d_299.root 
DEBUG 2016-12-20 15:39:00,305:      INFO 2016-12-20 15:12:57,335:    Retrieving JpsiDMuon_v28d-Run2016G-PromptReco-v1-278819-280385d_372.root 
DEBUG 2016-12-20 15:39:00,305:      DEBUG 2016-12-20 15:12:57,335:   Executing env -i X509_USER_PROXY=/tmp/x509up_u50115 gfal-copy -v -K ADLER32  -T 1800  --srm-timeout 721  srm://srm.ihep.ac.cn:8443/srm/managerv2?SFN=/pnfs/ihep.ac.cn/data/cms/store/user/bianjg/Charmonium/JpsiDMuon_v28d-Run2016G-PromptReco-v1-278819-280385d/161214_130855/0000/JpsiDMuon_v28d-Run2016G-PromptReco-v1-278819-280385d_372.root file:///publicfs/cms/data/Psi2S/work/CRAB_2_6_2/JOBS/QuadriMuon_v28/JpsiDMuon_v28d-Run2016G-PromptReco-v1-278819-280385/crab_projects/crab_JpsiDMuon_v28d-Run2016G-PromptReco-v1-278819-280385d/results/JpsiDMuon_v28d-Run2016G-PromptReco-v1-278819-280385d_372.root
DEBUG 2016-12-20 15:39:00,305:      DEBUG 2016-12-20 15:12:57,391:   Finish executing for file JpsiDMuon_v28d-Run2016G-PromptReco-v1-278819-280385d_372.root
DEBUG 2016-12-20 15:39:00,305:      INFO 2016-12-20 15:12:57,392:    Warning: Failed retrieving JpsiDMuon_v28d-Run2016G-PromptReco-v1-278819-280385d_372.root
DEBUG 2016-12-20 15:39:00,305:      INFO 2016-12-20 15:12:57,392:        usage: gfal-copy [-h] [-v] [-v] [-d definition] [-t timeout] [-e cert]
DEBUG 2016-12-20 15:39:00,305:      INFO 2016-12-20 15:12:57,392:         --srm-timeout
DEBUG 2016-12-20 15:39:00,305:      INFO 2016-12-20 15:12:57,392:                         [-s src_spacetoken] [-s dst_spacetoken] [-t transfer_timeout]
DEBUG 2016-12-20 15:39:00,305:      DEBUG 2016-12-20 15:12:57,393:   Full stderr follows:
DEBUG 2016-12-20 15:39:00,305:      usage: gfal-copy [-h] [-V] [-v] [-D DEFINITION] [-t TIMEOUT] [-E CERT]
DEBUG 2016-12-20 15:39:00,305:                       [--key KEY] [-4] [-6] [-C CLIENT_INFO] [-f] [-p]
DEBUG 2016-12-20 15:39:00,305:                       [-n NBSTREAMS] [--tcp-buffersize TCP_BUFFERSIZE]
DEBUG 2016-12-20 15:39:00,305:                       [-s SRC_SPACETOKEN] [-S DST_SPACETOKEN] [-T TRANSFER_TIMEOUT]
DEBUG 2016-12-20 15:39:00,305:                       [-K CHECKSUM] [--from-file FROM_FILE] [--just-copy] [-r]
DEBUG 2016-12-20 15:39:00,305:                       [--abort-on-failure] [--dry-run]
DEBUG 2016-12-20 15:39:00,305:                       [src] dst [dst ...]
DEBUG 2016-12-20 15:39:00,305:      gfal-copy: error: unrecognized arguments: --srm-timeout
DEBUG 2016-12-20 15:39:00,305:      

[2]

DEBUG 2016-12-20 15:39:00,285:      INFO 2016-12-20 15:11:45,871:    Warning: Failed retrieving JpsiDMuon_v28d-Run2016G-PromptReco-v1-278819-280385d_315.root
DEBUG 2016-12-20 15:39:00,285:      INFO 2016-12-20 15:11:45,871:         can't start new thread
DEBUG 2016-12-20 15:39:00,285:      DEBUG 2016-12-20 15:11:45,872:   Full stderr follows:
DEBUG 2016-12-20 15:39:00,285:      Traceback (most recent call last):
DEBUG 2016-12-20 15:39:00,285:        File "/usr/bin/gfal-copy", line 12, in <module>
DEBUG 2016-12-20 15:39:00,285:          sys.exit(Gfal2Shell().main(sys.argv))
DEBUG 2016-12-20 15:39:00,285:        File "/usr/lib/python2.6/site-packages/gfal2_util/shell.py", line 71, in main
DEBUG 2016-12-20 15:39:00,285:          return inst.execute(command_func)
DEBUG 2016-12-20 15:39:00,285:        File "/usr/lib/python2.6/site-packages/gfal2_util/base.py", line 151, in execute
DEBUG 2016-12-20 15:39:00,285:          t_main.start()
DEBUG 2016-12-20 15:39:00,285:        File "/usr/lib64/python2.6/threading.py", line 474, in start
DEBUG 2016-12-20 15:39:00,285:          _start_new_thread(self.__bootstrap, ())
DEBUG 2016-12-20 15:39:00,285:      thread.error: can't start new thread
DEBUG 2016-12-20 15:39:00,285:      
DEBUG 2016-12-20 15:39:00,285:      DEBUG 2016-12-20 15:11:46,105:   Finish executing for file JpsiDMuon_v28d-Run2016G-PromptReco-v1-278819-280385d_221.root
DEBUG 2016-12-20 15:39:00,285:      INFO 2016-12-20 15:11:46,106:    Warning: Failed retrieving JpsiDMuon_v28d-Run2016G-PromptReco-v1-278819-280385d_221.root
DEBUG 2016-12-20 15:39:00,285:      INFO 2016-12-20 15:11:46,106:         can't start new thread
DEBUG 2016-12-20 15:39:00,285:      DEBUG 2016-12-20 15:11:46,107:   Full stderr follows:
DEBUG 2016-12-20 15:39:00,285:      Traceback (most recent call last):
DEBUG 2016-12-20 15:39:00,285:        File "/usr/bin/gfal-copy", line 12, in <module>
DEBUG 2016-12-20 15:39:00,286:          sys.exit(Gfal2Shell().main(sys.argv))
DEBUG 2016-12-20 15:39:00,286:        File "/usr/lib/python2.6/site-packages/gfal2_util/shell.py", line 71, in main
DEBUG 2016-12-20 15:39:00,286:          return inst.execute(command_func)
DEBUG 2016-12-20 15:39:00,286:        File "/usr/lib/python2.6/site-packages/gfal2_util/base.py", line 151, in execute
DEBUG 2016-12-20 15:39:00,286:          t_main.start()
DEBUG 2016-12-20 15:39:00,286:        File "/usr/lib64/python2.6/threading.py", line 474, in start
DEBUG 2016-12-20 15:39:00,286:          _start_new_thread(self.__bootstrap, ())
DEBUG 2016-12-20 15:39:00,286:      thread.error: can't start new thread
DEBUG 2016-12-20 15:39:00,286:      
belforte commented 7 years ago

reference: https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/2450/1.html

Also I do not like that gfal-copy is called with decreasing timeouts, but that's a different story. The management of the timeout here is not friendly, but maybe we can simply put better messages and a bit of documentation to push users into tuning the amount of files to copy in each command.