Closed AndresTanasijczuk closed 8 years ago
Maybe it is important to add that vocms096 is having full disk problems by these days, which explains why node state file may not be available or even corrupted. The problem to understand is why information from other users task would leaked into.
Yes. it is not a bug, so what happened is that users pset config is printing really really really long lines, I put below only part of this line of his log files. 1 log file takes 193MB, so ~2000 jobs will take a lot of space on schedd. Limit per user on scheduler has to be implemented, but also need to know what will happen if quota exceeded and jobs should not be started, allowed to be resubmitted.
Just to improve, we have to change here: https://github.com/dmwm/CRABServer/blob/d0ca16f2b76b5f54f5e8a8d9f30d22e38233c018/scripts/CMSRunAnalysis.py#L314-L330 to something like:
linesTruncated = False
if len(line) > 2000:
linesTruncated = True
logOutLine = line[:2000]
After all:
if linesTruncated:
print 'INFO: Some lines have been truncated, because it lenght was too long (Max line lenght 2000chars). If you want to have full log files, please use save logs flag in your crab config.'
In that case cmsRun output will be limited up to ~8MB.
== CMSSW: Begin processing the 1st record. Run 1, Event 11367166, LumiSection 114184 at 03-Nov-2015 10:38:54.211 CET
== CMSSW: syst is noSyst resetting variable Event_weight before 1syst is noSyst resetting variable Event_nTightMuons before 0syst is noSyst resetting variable Event_nSoftMuons before 0syst is noSyst resetting variable Event_nLooseMuons before 0syst is noSyst resetting variable Event_nTightElectrons before 0syst is noSyst resetting variable Event_nMediumElectrons before 0syst is noSyst resetting variable Event_nLooseElectrons before 0syst is noSyst resetting variable Event_nVetoElectrons before 0syst is noSyst resetting variable Event_nElectronsSF before 0syst is noSyst resetting variable Event_mt before 0syst is noSyst resetting variable Event_Mt2w before 0syst is noSyst resetting variable Event_category before 5100syst is noSyst resetting variable Event_nMuonsSF before 0syst is noSyst resetting variable Event_nCSVTJets before 0syst is noSyst resetting variable Event_nCSVMJets before 0syst is noSyst resetting variable Event_nCSVLJets before 0syst is noSyst resetting variable Event_nTightJets before 0syst is noSyst resetting variable Event_nLooseJets before 0syst is noSyst resetting variable Event_nType1TopJets before 0syst is noSyst resetting variable Event_nType2TopJets before 0syst is noSyst resetting variable Event_Ht before 0syst is noSyst resetting variable Event_nGoodPV before 10syst is noSyst resetting variable Event_nPV before 10syst is noSyst resetting variable Event_nTruePV before 16syst is noSyst resetting variable Event_bWeight0CSVT before 0syst is noSyst resetting variable Event_bWeight1CSVT before 0.966252syst is noSyst resetting variable Event_bWeight2CSVT before 0syst is noSyst resetting variable Event_bWeight1_2CSVT before 0.939019syst is noSyst resetting variable Event_bWeight0CSVM before 0syst is noSyst resetting variable Event_bWeight1CSVM before 1.01244syst is noSyst resetting variable Event_bWeight2CSVM before 0syst is noSyst resetting variable Event_bWeight1_2CSVM before 0.979475syst is noSyst resetting variable Event_bWeight0CSVL before 0syst is noSyst resetting variable Event_bWeight1CSVL before 0syst is noSyst resetting variable Event_bWeight2CSVL before 0syst is noSyst resetting variable Event_bWeight1_2CSVL before 0syst is noSyst resetting variable Event_bWeightMisTagDown0CSVT before 0syst is noSyst resetting variable Event_bWeightMisTagDown1CSVT before 1.00018syst is noSyst resetting variable Event_bWeightMisTagDown2CSVT before 0syst is noSyst resetting variable Event_bWeightMisTagDown1_2CSVT before 0.999627syst is noSyst resetting variable Event_bWeightMisTagDown0CSVM before 0syst is noSyst resetting variable Event_bWeightMisTagDown1CSVM before 1.0054syst is noSyst resetting variable Event_bWeightMisTagDown2CSVM before 0syst is noSyst resetting variable Event_bWeightMisTagDown1_2CSVM before 1.00168syst is noSyst resetting variable Event_bWeightMisTagDown0CSVL before 0syst is noSyst resetting variable Event_bWeightMisTagDown1CSVL before 0syst is noSyst resetting variable Event_bWeightMisTagDown2CSVL before 0syst is noSyst resetting variable Event_bWeightMisTagDown1_2CSVL before 0syst is noSyst resetting variable Event_bWeightMisTagUp0CSVT before 0syst is noSyst resetting variable Event_bWeightMisTagUp1CSVT before 0.999822syst is noSyst resetting variable Event_bWeightMisTagUp2CSVT before 0syst is noSyst resetting variable Event_bWeightMisTagUp1_2CSVT before 1.00037syst is noSyst resetting variable Event_bWeightMisTagUp0CSVM before 0syst is noSyst resetting variable Event_bWeightMisTagUp1CSVM before 0.994608syst is noSyst resetting variable Event_bWeightMisTagUp2CSVM before 0syst is noSyst resetting variable Event_bWeightMisTagUp1_2CSVM before 0.998306syst is noSyst resetting variable Event_bWeightMisTagUp0CSVL before 0syst is noSyst resetting variable Event_bWeightMisTagUp1CSVL before 0syst is noSyst resetting variable Event_bWeightMisTagUp2CSVL before 0syst is noSyst resetting variable Event_bWeightMisTagUp1_2CSVL before 0syst is noSyst resetting variable Event_bWeightBTagUp0CSVT before 0syst is noSyst resetting variable Event_bWeightBTagUp1CSVT before 0.996965syst is noSyst resetting variable Event_bWeightBTagUp2CSVT before 0syst is noSyst resetting variable Event_bWeightBTagUp1_2CSVT before 0.993606syst is noSyst resetting variable Event_bWeightBTagUp0CSVM before 0syst is noSyst resetting variable Event_bWeightBTagUp1CSVM before 0.993149syst is noSyst resetting variable Event_bWeightBTagUp2CSVM before 0syst is noSyst resetting variable Event_bWeightBTagUp1_2CSVM before 1.00909syst is noSyst resetting variable Event_bWeightBTagUp0CSVL before 0syst is noSyst resetting variable Event_bWeightBTagUp1CSVL before 0syst is noSyst resetting variable Event_bWeightBTagUp2CSVL before 0syst is noSyst resetting variable Event_bWeightBTagUp1_2CSVL before 0syst is noSyst resetting variable Event_bWeightBTagDown0CSVT before 0syst is noSyst resetting variable Event_bWeightBTagDown1CSVT before 0.930163syst is noSyst resetting variable Event_bWeightBTagDown2CSVT before 0syst is noSyst resetting variable Event_bWeightBTagDown1_2CSVT before 0.882369syst is noSyst resetting variable Event_bWeightBTagDown0CSVM before 0syst is noSyst resetting variable Event_bWeightBTagDown1CSVM before 1.02449syst is noSyst resetting variable Event_bWeightBTagDown2CSVM before 0syst is noSyst resetting variable Event_bWeightBTagDown1_2CSVM before 0.947675syst is noSyst resetting variable Event_bWeightBTagDown0CSVL before 0syst is noSyst resetting variable Event_bWeightBTagDown1CSVL before 0syst is noSyst resetting variable Event_bWeightBTagDown2CSVL before 0syst is noSyst resetting variable Event_bWeightBTagDown1_2CSVL before 0syst is noSyst resetting variable Event_Lepton1_Pt before 0syst is noSyst resetting variable Event_Lepton1_Eta before 0syst is noSyst resetting variable Event_Lepton1_Phi before 0syst is noSyst resetting variable Event_Lepton1_E before 0syst is noSyst resetting variable Event_Lepton1_Charge before 0syst is noSyst resetting variable Event_Lepton1_Flavour before 0syst is noSyst resetting variable Event_Lepton2_Pt before 0syst is noSyst resetting variable Event_Lepton2_Eta before 0syst is noSyst resetting variable
@juztas I don't understand what it has to do the issue you describe about the line length in CMSSW logs with the issue of "foreign tasks information leaking into crab status". Can you explain?
you have to find root of the problem and not the leafs. So if schedd is 100% full of disk, anything can happen and the root problem of all was this users task wit unlimited len of line. By anything I mean: node state corrupted, task submitted and fail without good reasons, postjobs fail, fds leaking, memory increasing, swapping, anything... So if this happens on other schedds I would worry, but if only this, which is due to known root problem.
As a non-expert, it is just hard for me to believe that "full disk" would cause the effect described in this issue, in the sense that I would believe the OS (or whatever) should have been designed to "correctly fail" in this situation. But if you say that the issue described here is possible, then ok. Should we close this issue then and re-open if the problem appears again?
I will take a deeper look in particular task, I can be wrong also. Just leave it open for the next month
I am sure we will remember this if it comes again. So I propose to close, and forget until then. can close myself once I have web, not now
Feel free to close.
no. It happened on other schedd.
2015-11-06 12:06 GMT+01:00 Andres Tanasijczuk notifications@github.com:
Feel free to close.
— Reply to this email directly or view it on GitHub https://github.com/dmwm/CRABServer/issues/5066#issuecomment-154382301.
Justas Balcas
mh.. ok, first need to read mail. Feel free to close. Another problem seems that schedd is was not updated in DB for that task.
sorry, ignore. mixed two diff tasks...
I think I got a corrupted task status output myself:
[14:19:40][atanasi@lxplus0072][/afs/cern.ch/work/a/atanasi/crab3/tasks_from_users/]$ crab remake --task 151111_120152:piet_crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM
Warning: Incompatible CRABClient version development
Server is saying that compatible versions are: ['3.3.1506', '3.3.1507', '3.3.1509', '3.3.1510', '3.3.1511', '3.3.1511.rc1']
Remaking crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM folder.
Remaking .requestcache file.
Success: Finished remaking project directory crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM
Log file is /afs/cern.ch/work/a/atanasi/crab3/tasks_from_users/crab.log
[14:19:58][atanasi@lxplus0072][/afs/cern.ch/work/a/atanasi/crab3/tasks_from_users/]$ crab status --dir crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM
Warning: Incompatible CRABClient version development
Server is saying that compatible versions are: ['3.3.1506', '3.3.1507', '3.3.1509', '3.3.1510', '3.3.1511', '3.3.1511.rc1']
CRAB project directory: /afs/cern.ch/work/a/atanasi/crab3/tasks_from_users/crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM
Task name: 151111_120152:piet_crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM
Grid scheduler: crab3-5@vocms059.cern.ch
Task status:
Dashboard monitoring URL: http://dashb-cms-job.cern.ch/dashboard/templates/task-analysis/#user=crab&refresh=0&table=Jobs&p=1&records=25&activemenu=2&status=&site=&tid=151111_120152%3Apiet_crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM
Warning: Extracting files from sandbox failed, ops monitor will not work.
No jobs created yet!
Log file is /afs/cern.ch/work/a/atanasi/crab3/tasks_from_users/crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM/crab.log
[14:20:22][atanasi@lxplus0072][/afs/cern.ch/work/a/atanasi/crab3/tasks_from_users/]$ crab status --dir crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM
Warning: Incompatible CRABClient version development
Server is saying that compatible versions are: ['3.3.1506', '3.3.1507', '3.3.1509', '3.3.1510', '3.3.1511', '3.3.1511.rc1']
CRAB project directory: /afs/cern.ch/work/a/atanasi/crab3/tasks_from_users/crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM
Task name: 151111_120152:piet_crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM
Grid scheduler: crab3-7@vocms0114.cern.ch
Task status: QUEUED
Dashboard monitoring URL: http://dashb-cms-job.cern.ch/dashboard/templates/task-analysis/#user=crab&refresh=0&table=Jobs&p=1&records=25&activemenu=2&status=&site=&tid=151111_120152%3Apiet_crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM
Jobs status: failed 39.4% ( 394/1000)
finished 60.6% ( 606/1000)
Publication status: idle 1.6% ( 16/1000)
finished 60.6% ( 606/1000)
unsubmitted 37.8% ( 378/1000)
Output dataset: /DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM/piet-crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM-e5c9e27b083729a42bdb88165df5771f/USER
Output dataset DAS URL: https://cmsweb.cern.ch/das/request?input=%2FDYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM%2Fpiet-crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM-e5c9e27b083729a42bdb88165df5771f%2FUSER&instance=prod%2Fphys03
Error Summary: (use --verboseErrors for details about the errors)
357 jobs failed with exit code -1
21 jobs failed with exit code 8020
16 jobs failed in postprocessing step
Have a look at https://twiki.cern.ch/twiki/bin/viewauth/CMSPublic/JobExitCodes for a description of the exit codes.
Log file is /afs/cern.ch/work/a/atanasi/crab3/tasks_from_users/crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM/crab.log
OK, we have a real (though hopefully rare) problem then. @AndresTanasijczuk did it get back to normal by itself in subsequent crab status ? or is it stuck with the wrong task info ?
It came back to normal in next crab status. You can see it in the output I pasted, the first crab status is the corrupted one (shows no task status, wrong schedd and a warning message Extracting files from sandbox failed, ops monitor will not work.
) while the second crab status is ok.
but in what you pasted task name always refer to user "piet" 151111_120152:piet_crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM
maybe the problem is that schedd name is the wrong one, not that it picks a different task ?
yes, but this info doesn't come from the server, nor the dashboard URL
?? do you want to say that the wrong schedd was picked in the client note, not server-side ? that's difficult to tell w/o more info, no ?
The schedd info comes from the server. The warning message also. But the taskname and the dashboard URL are printed by the client with no info needed from the server.
OK. But then there is no indication of "foreign task information". Only that it picked wrong schedd. I guess if schedd name is wrong it explains that sandbox could not be opened.
Simply that printing info from a different task is somehow worse than simply printing an odd error. And given that it self-recover.. I suspect we'll have to learn to live with this until it is fixed when code is modified for some other reason.
Another user who reported this: https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/1144.html
Another user who reported the same issue: https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/1147.html
I had a look at this, I think the bug was introduced in this commit where I changed result
to self.result
: https://github.com/dmwm/CRABServer/commit/a8015cc9202b5298636c64cadd5c5ac190d3d762
Since there are multiple threads sharing the same instance of the HTCondorDataWorkflow class, then multiple status requests ends up rewriting the same variable. I think I can provide a fix for the december release.
great! thanks Marco !
Let's make a mental note that we need to train current and (expecially) future persons in CRAB development and support on how to write, check, detect thread (un)safe code.
One of the difficult thing here is that it is not clear that there are multiple threads looking at the CRAB3 code only. One has to dig into WMCore and look at the code of Lassi's REST interface to get it. Or in general you have to know that each cmsweb backend is multithreaded. Then you need a little bit of fantasy to get concurrency issues. I doubt something like that can be teached.
We can put some rules, like never use member (self
) variables in REST methods, and document how the system work. But nothing more than that imho.
Anyway, I habve a patch for this, I'll create the December tags ASAP.
Already two users reported that 'crab status' (I think both happened with 3.3.1511) showed a status of some other task, from some other user. For example: https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/1082/1/1/1/1/1/1/1/1/1/1/1/1.html whose relevant part I am pasting below:
The other user reporting weird crab status output is https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/1114.html