dmwm / CRABServer

15 stars 38 forks source link

foreign tasks information leaking into crab status ??? #5066

Closed AndresTanasijczuk closed 8 years ago

AndresTanasijczuk commented 8 years ago

Already two users reported that 'crab status' (I think both happened with 3.3.1511) showed a status of some other task, from some other user. For example: https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/1082/1/1/1/1/1/1/1/1/1/1/1/1.html whose relevant part I am pasting below:

[ptan@cmslpc26 test]$ crab status
CRAB project directory:         /uscms_data/d2/ptan/work/sl6/development/CMSSW_5_3_30/src/Analysis_GEN/WZEdmAnalyzer/test/crab_projects/crab_20151105_224416
Task name:                      151106_044432:ptan_crab_20151105_224416
Grid scheduler:                 crab3test-3@vocms096.cern.ch
Task status:                    SUBMITTED
Dashboard monitoring URL:       http://dashb-cms-job.cern.ch/dashboard/templates/task-analysis/#user=crab&refresh=0&table=Jobs&p=1&records=25&activemenu=2&status=&site=&tid=151106_044432%3Aptan_crab_20151105_224416
Error retrieving task status:   Node status file not currently available. Retry in a minute if you just submitted the task.

Jobs status:                    failed          5.9% ( 301/5128)
                                finished       94.1% (4824/5128)
                                running         0.1% (   3/5128)

Publication status:             idle            0.0% (   1/5128)
                                finished      100.0% (5127/5128)

Output dataset:                 /SingleMuon/ahart-Run2015D-05Oct2015-v1-v1-317ea5f9757787101047d4cf35c3fe60/USER
Output dataset DAS URL:         https://cmsweb.cern.ch/das/request?input=%2FSingleMuon%2Fahart-Run2015D-05Oct2015-v1-v1-317ea5f9757787101047d4cf35c3fe60%2FUSER&instance=prod%2Fphys03

Error Summary: (use --verboseErrors for details about the errors)

 301 jobs failed in postprocessing step

Have a look at https://twiki.cern.ch/twiki/bin/viewauth/CMSPublic/JobExitCodes for a description of the exit codes.
Log file is /uscms_data/d2/ptan/work/sl6/development/CMSSW_5_3_30/src/Analysis_GEN/WZEdmAnalyzer/test/crab_projects/crab_20151105_224416/crab.log

The other user reporting weird crab status output is https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/1114.html

AndresTanasijczuk commented 8 years ago

Maybe it is important to add that vocms096 is having full disk problems by these days, which explains why node state file may not be available or even corrupted. The problem to understand is why information from other users task would leaked into.

juztas commented 8 years ago

Yes. it is not a bug, so what happened is that users pset config is printing really really really long lines, I put below only part of this line of his log files. 1 log file takes 193MB, so ~2000 jobs will take a lot of space on schedd. Limit per user on scheduler has to be implemented, but also need to know what will happen if quota exceeded and jobs should not be started, allowed to be resubmitted.

Just to improve, we have to change here: https://github.com/dmwm/CRABServer/blob/d0ca16f2b76b5f54f5e8a8d9f30d22e38233c018/scripts/CMSRunAnalysis.py#L314-L330 to something like:

linesTruncated = False
if len(line) > 2000:
    linesTruncated = True
    logOutLine = line[:2000]

After all:
if linesTruncated:
    print 'INFO: Some lines have been truncated, because it lenght was too long (Max line lenght 2000chars). If you want to have full log files, please use save logs flag in your crab config.'

In that case cmsRun output will be limited up to ~8MB.

== CMSSW:  Begin processing the 1st record. Run 1, Event 11367166, LumiSection 114184 at 03-Nov-2015 10:38:54.211 CET
== CMSSW:  syst is noSyst resetting variable Event_weight before 1syst is noSyst resetting variable Event_nTightMuons before 0syst is noSyst resetting variable Event_nSoftMuons before 0syst is noSyst resetting variable Event_nLooseMuons before 0syst is noSyst resetting variable Event_nTightElectrons before 0syst is noSyst resetting variable Event_nMediumElectrons before 0syst is noSyst resetting variable Event_nLooseElectrons before 0syst is noSyst resetting variable Event_nVetoElectrons before 0syst is noSyst resetting variable Event_nElectronsSF before 0syst is noSyst resetting variable Event_mt before 0syst is noSyst resetting variable Event_Mt2w before 0syst is noSyst resetting variable Event_category before 5100syst is noSyst resetting variable Event_nMuonsSF before 0syst is noSyst resetting variable Event_nCSVTJets before 0syst is noSyst resetting variable Event_nCSVMJets before 0syst is noSyst resetting variable Event_nCSVLJets before 0syst is noSyst resetting variable Event_nTightJets before 0syst is noSyst resetting variable Event_nLooseJets before 0syst is noSyst resetting variable Event_nType1TopJets before 0syst is noSyst resetting variable Event_nType2TopJets before 0syst is noSyst resetting variable Event_Ht before 0syst is noSyst resetting variable Event_nGoodPV before 10syst is noSyst resetting variable Event_nPV before 10syst is noSyst resetting variable Event_nTruePV before 16syst is noSyst resetting variable Event_bWeight0CSVT before 0syst is noSyst resetting variable Event_bWeight1CSVT before 0.966252syst is noSyst resetting variable Event_bWeight2CSVT before 0syst is noSyst resetting variable Event_bWeight1_2CSVT before 0.939019syst is noSyst resetting variable Event_bWeight0CSVM before 0syst is noSyst resetting variable Event_bWeight1CSVM before 1.01244syst is noSyst resetting variable Event_bWeight2CSVM before 0syst is noSyst resetting variable Event_bWeight1_2CSVM before 0.979475syst is noSyst resetting variable Event_bWeight0CSVL before 0syst is noSyst resetting variable Event_bWeight1CSVL before 0syst is noSyst resetting variable Event_bWeight2CSVL before 0syst is noSyst resetting variable Event_bWeight1_2CSVL before 0syst is noSyst resetting variable Event_bWeightMisTagDown0CSVT before 0syst is noSyst resetting variable Event_bWeightMisTagDown1CSVT before 1.00018syst is noSyst resetting variable Event_bWeightMisTagDown2CSVT before 0syst is noSyst resetting variable Event_bWeightMisTagDown1_2CSVT before 0.999627syst is noSyst resetting variable Event_bWeightMisTagDown0CSVM before 0syst is noSyst resetting variable Event_bWeightMisTagDown1CSVM before 1.0054syst is noSyst resetting variable Event_bWeightMisTagDown2CSVM before 0syst is noSyst resetting variable Event_bWeightMisTagDown1_2CSVM before 1.00168syst is noSyst resetting variable Event_bWeightMisTagDown0CSVL before 0syst is noSyst resetting variable Event_bWeightMisTagDown1CSVL before 0syst is noSyst resetting variable Event_bWeightMisTagDown2CSVL before 0syst is noSyst resetting variable Event_bWeightMisTagDown1_2CSVL before 0syst is noSyst resetting variable Event_bWeightMisTagUp0CSVT before 0syst is noSyst resetting variable Event_bWeightMisTagUp1CSVT before 0.999822syst is noSyst resetting variable Event_bWeightMisTagUp2CSVT before 0syst is noSyst resetting variable Event_bWeightMisTagUp1_2CSVT before 1.00037syst is noSyst resetting variable Event_bWeightMisTagUp0CSVM before 0syst is noSyst resetting variable Event_bWeightMisTagUp1CSVM before 0.994608syst is noSyst resetting variable Event_bWeightMisTagUp2CSVM before 0syst is noSyst resetting variable Event_bWeightMisTagUp1_2CSVM before 0.998306syst is noSyst resetting variable Event_bWeightMisTagUp0CSVL before 0syst is noSyst resetting variable Event_bWeightMisTagUp1CSVL before 0syst is noSyst resetting variable Event_bWeightMisTagUp2CSVL before 0syst is noSyst resetting variable Event_bWeightMisTagUp1_2CSVL before 0syst is noSyst resetting variable Event_bWeightBTagUp0CSVT before 0syst is noSyst resetting variable Event_bWeightBTagUp1CSVT before 0.996965syst is noSyst resetting variable Event_bWeightBTagUp2CSVT before 0syst is noSyst resetting variable Event_bWeightBTagUp1_2CSVT before 0.993606syst is noSyst resetting variable Event_bWeightBTagUp0CSVM before 0syst is noSyst resetting variable Event_bWeightBTagUp1CSVM before 0.993149syst is noSyst resetting variable Event_bWeightBTagUp2CSVM before 0syst is noSyst resetting variable Event_bWeightBTagUp1_2CSVM before 1.00909syst is noSyst resetting variable Event_bWeightBTagUp0CSVL before 0syst is noSyst resetting variable Event_bWeightBTagUp1CSVL before 0syst is noSyst resetting variable Event_bWeightBTagUp2CSVL before 0syst is noSyst resetting variable Event_bWeightBTagUp1_2CSVL before 0syst is noSyst resetting variable Event_bWeightBTagDown0CSVT before 0syst is noSyst resetting variable Event_bWeightBTagDown1CSVT before 0.930163syst is noSyst resetting variable Event_bWeightBTagDown2CSVT before 0syst is noSyst resetting variable Event_bWeightBTagDown1_2CSVT before 0.882369syst is noSyst resetting variable Event_bWeightBTagDown0CSVM before 0syst is noSyst resetting variable Event_bWeightBTagDown1CSVM before 1.02449syst is noSyst resetting variable Event_bWeightBTagDown2CSVM before 0syst is noSyst resetting variable Event_bWeightBTagDown1_2CSVM before 0.947675syst is noSyst resetting variable Event_bWeightBTagDown0CSVL before 0syst is noSyst resetting variable Event_bWeightBTagDown1CSVL before 0syst is noSyst resetting variable Event_bWeightBTagDown2CSVL before 0syst is noSyst resetting variable Event_bWeightBTagDown1_2CSVL before 0syst is noSyst resetting variable Event_Lepton1_Pt before 0syst is noSyst resetting variable Event_Lepton1_Eta before 0syst is noSyst resetting variable Event_Lepton1_Phi before 0syst is noSyst resetting variable Event_Lepton1_E before 0syst is noSyst resetting variable Event_Lepton1_Charge before 0syst is noSyst resetting variable Event_Lepton1_Flavour before 0syst is noSyst resetting variable Event_Lepton2_Pt before 0syst is noSyst resetting variable Event_Lepton2_Eta before 0syst is noSyst resetting variable
AndresTanasijczuk commented 8 years ago

@juztas I don't understand what it has to do the issue you describe about the line length in CMSSW logs with the issue of "foreign tasks information leaking into crab status". Can you explain?

juztas commented 8 years ago

you have to find root of the problem and not the leafs. So if schedd is 100% full of disk, anything can happen and the root problem of all was this users task wit unlimited len of line. By anything I mean: node state corrupted, task submitted and fail without good reasons, postjobs fail, fds leaking, memory increasing, swapping, anything... So if this happens on other schedds I would worry, but if only this, which is due to known root problem.

AndresTanasijczuk commented 8 years ago

As a non-expert, it is just hard for me to believe that "full disk" would cause the effect described in this issue, in the sense that I would believe the OS (or whatever) should have been designed to "correctly fail" in this situation. But if you say that the issue described here is possible, then ok. Should we close this issue then and re-open if the problem appears again?

juztas commented 8 years ago

I will take a deeper look in particular task, I can be wrong also. Just leave it open for the next month

belforte commented 8 years ago

I am sure we will remember this if it comes again. So I propose to close, and forget until then. can close myself once I have web, not now

AndresTanasijczuk commented 8 years ago

Feel free to close.

juztas commented 8 years ago

no. It happened on other schedd.

2015-11-06 12:06 GMT+01:00 Andres Tanasijczuk notifications@github.com:

Feel free to close.

— Reply to this email directly or view it on GitHub https://github.com/dmwm/CRABServer/issues/5066#issuecomment-154382301.

Justas Balcas

juztas commented 8 years ago

mh.. ok, first need to read mail. Feel free to close. Another problem seems that schedd is was not updated in DB for that task.

juztas commented 8 years ago

sorry, ignore. mixed two diff tasks...

AndresTanasijczuk commented 8 years ago

I think I got a corrupted task status output myself:

[14:19:40][atanasi@lxplus0072][/afs/cern.ch/work/a/atanasi/crab3/tasks_from_users/]$ crab remake --task 151111_120152:piet_crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM
Warning: Incompatible CRABClient version development
Server is saying that compatible versions are: ['3.3.1506', '3.3.1507', '3.3.1509', '3.3.1510', '3.3.1511', '3.3.1511.rc1']
Remaking crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM folder.
Remaking .requestcache file.
Success: Finished remaking project directory crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM
Log file is /afs/cern.ch/work/a/atanasi/crab3/tasks_from_users/crab.log

[14:19:58][atanasi@lxplus0072][/afs/cern.ch/work/a/atanasi/crab3/tasks_from_users/]$ crab status --dir crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM
Warning: Incompatible CRABClient version development
Server is saying that compatible versions are: ['3.3.1506', '3.3.1507', '3.3.1509', '3.3.1510', '3.3.1511', '3.3.1511.rc1']
CRAB project directory:     /afs/cern.ch/work/a/atanasi/crab3/tasks_from_users/crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM
Task name:          151111_120152:piet_crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM
Grid scheduler:         crab3-5@vocms059.cern.ch
Task status:            
Dashboard monitoring URL:   http://dashb-cms-job.cern.ch/dashboard/templates/task-analysis/#user=crab&refresh=0&table=Jobs&p=1&records=25&activemenu=2&status=&site=&tid=151111_120152%3Apiet_crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM
Warning:            Extracting files from sandbox failed, ops monitor will not work.

No jobs created yet!
Log file is /afs/cern.ch/work/a/atanasi/crab3/tasks_from_users/crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM/crab.log

[14:20:22][atanasi@lxplus0072][/afs/cern.ch/work/a/atanasi/crab3/tasks_from_users/]$ crab status --dir crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM
Warning: Incompatible CRABClient version development
Server is saying that compatible versions are: ['3.3.1506', '3.3.1507', '3.3.1509', '3.3.1510', '3.3.1511', '3.3.1511.rc1']
CRAB project directory:     /afs/cern.ch/work/a/atanasi/crab3/tasks_from_users/crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM
Task name:          151111_120152:piet_crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM
Grid scheduler:         crab3-7@vocms0114.cern.ch
Task status:            QUEUED
Dashboard monitoring URL:   http://dashb-cms-job.cern.ch/dashboard/templates/task-analysis/#user=crab&refresh=0&table=Jobs&p=1&records=25&activemenu=2&status=&site=&tid=151111_120152%3Apiet_crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM

Jobs status:            failed         39.4% ( 394/1000)
                finished       60.6% ( 606/1000)

Publication status:     idle            1.6% (  16/1000)
                finished       60.6% ( 606/1000)
                unsubmitted    37.8% ( 378/1000)

Output dataset:         /DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM/piet-crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM-e5c9e27b083729a42bdb88165df5771f/USER
Output dataset DAS URL:     https://cmsweb.cern.ch/das/request?input=%2FDYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM%2Fpiet-crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM-e5c9e27b083729a42bdb88165df5771f%2FUSER&instance=prod%2Fphys03

Error Summary: (use --verboseErrors for details about the errors)

 357 jobs failed with exit code -1

  21 jobs failed with exit code 8020

  16 jobs failed in postprocessing step

Have a look at https://twiki.cern.ch/twiki/bin/viewauth/CMSPublic/JobExitCodes for a description of the exit codes.
Log file is /afs/cern.ch/work/a/atanasi/crab3/tasks_from_users/crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM/crab.log
belforte commented 8 years ago

OK, we have a real (though hopefully rare) problem then. @AndresTanasijczuk did it get back to normal by itself in subsequent crab status ? or is it stuck with the wrong task info ?

AndresTanasijczuk commented 8 years ago

It came back to normal in next crab status. You can see it in the output I pasted, the first crab status is the corrupted one (shows no task status, wrong schedd and a warning message Extracting files from sandbox failed, ops monitor will not work.) while the second crab status is ok.

belforte commented 8 years ago

but in what you pasted task name always refer to user "piet" 151111_120152:piet_crab_DYToMuMu_M-20_TuneZ2star_14TeV-pythia6-tauola_HGCALGS_GEN-SIM

belforte commented 8 years ago

maybe the problem is that schedd name is the wrong one, not that it picks a different task ?

AndresTanasijczuk commented 8 years ago

yes, but this info doesn't come from the server, nor the dashboard URL

belforte commented 8 years ago

?? do you want to say that the wrong schedd was picked in the client note, not server-side ? that's difficult to tell w/o more info, no ?

AndresTanasijczuk commented 8 years ago

The schedd info comes from the server. The warning message also. But the taskname and the dashboard URL are printed by the client with no info needed from the server.

belforte commented 8 years ago

OK. But then there is no indication of "foreign task information". Only that it picked wrong schedd. I guess if schedd name is wrong it explains that sandbox could not be opened.

Simply that printing info from a different task is somehow worse than simply printing an odd error. And given that it self-recover.. I suspect we'll have to learn to live with this until it is fixed when code is modified for some other reason.

AndresTanasijczuk commented 8 years ago

Another user who reported this: https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/1144.html

AndresTanasijczuk commented 8 years ago

Another user who reported the same issue: https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/1147.html

mmascher commented 8 years ago

I had a look at this, I think the bug was introduced in this commit where I changed result to self.result : https://github.com/dmwm/CRABServer/commit/a8015cc9202b5298636c64cadd5c5ac190d3d762

Since there are multiple threads sharing the same instance of the HTCondorDataWorkflow class, then multiple status requests ends up rewriting the same variable. I think I can provide a fix for the december release.

belforte commented 8 years ago

great! thanks Marco !

belforte commented 8 years ago

Let's make a mental note that we need to train current and (expecially) future persons in CRAB development and support on how to write, check, detect thread (un)safe code.

mmascher commented 8 years ago

One of the difficult thing here is that it is not clear that there are multiple threads looking at the CRAB3 code only. One has to dig into WMCore and look at the code of Lassi's REST interface to get it. Or in general you have to know that each cmsweb backend is multithreaded. Then you need a little bit of fantasy to get concurrency issues. I doubt something like that can be teached.

We can put some rules, like never use member (self) variables in REST methods, and document how the system work. But nothing more than that imho.

Anyway, I habve a patch for this, I'll create the December tags ASAP.