HEP-KBFI / tth-htt

code and python config files for ttH, H -> tautau analysis with matrix element techniques
3 stars 10 forks source link

hadd doesn't seem to work if analysis .root files are stored on /hdfs #17

Closed veelken closed 7 years ago

veelken commented 7 years ago

Hi Margus,

the hadd step does not seem to work if the .root files produced by running the analysis code are stored on /hdfs .The error message is attached below [*] (from /hdfs/local/ttH_2tau/veelken/ttHAnalysis/2016/2017Mar13/stderr_jetToTauFakeRate.log )

A separate problem (not a bug, but a feature that is important is missing) is that problems that occur when running the analysis code [] are not logged in the main stderr file /hdfs/local/ttH_2tau/veelken/ttHAnalysis/2016/2017Mar13/stderr_jetToTauFakeRate.log and do not cause an immediate abort of the workflow. The way how I discovered the problem [] was that some histograms were missing in the final (large) hadd file. I then needed to look at the intermediate (smaller) hadd files and finally at the .root files that are produced when running the analysis code and the analysis log-file [] to find the problem. The error [] is fatal in the sense that there is no chance that the workflow can finish successfully once a single analysis job aborts with an exception. It would be very useful if you could implement a logic that once an analysis job finishes on the batch system, the log-file produced by the log is checked and the workflow terminated immediately in case an exception occured (or the .root file produced by the analysis job is corrupt).

Cheers,

Christian


[*]

Traceback (most recent call last): File "/home/veelken/ttHAnalysis/2016/2017Mar13/scripts/jetToTauFakeRate/sbatch_hadd_jetToTauFakeRate_stage1_DoubleMuon_Run2016E_v1_OS.py", line 8, in waitForJobs=True, File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/sbatchManager.py", line 73, in hadd_in_cluster cluster_histogram_aggregator.create_output_histogram() File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/ClusterHistogramAggregator.py", line 25, in create_output_histogram maximum_histograms_in_batch=self.maximum_histograms_in_batch File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/ClusterHistogramAggregator.py", line 61, in aggregate output_histogram=output_histogram File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/ClusterHistogramAggregator.py", line 207, in hadd_on_cluster_node output_dir=output_dir File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/sbatchManager.py", line 194, in submit_job_version2 job_id = sbatch_command_result.split()[-1] IndexError: list index out of range make: [/hdfs/local/ttH_2tau/veelken/ttHAnalysis/2016/2017Mar13/histograms/jetToTauFakeRate/histograms_harvested_stage1_jetToTauFakeRate_DoubleMuon_Run2016E_v1_OS.root] Error 1 make: Waiting for unfinished jobs.... Traceback (most recent call last): File "/home/veelken/ttHAnalysis/2016/2017Mar13/scripts/jetToTauFakeRate/sbatch_hadd_jetToTauFakeRate_stage1_MuonEG_Run2016C_v1_OS.py", line 8, in waitForJobs=True, File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/sbatchManager.py", line 73, in hadd_in_cluster cluster_histogram_aggregator.create_output_histogram() File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/ClusterHistogramAggregator.py", line 25, in create_output_histogram maximum_histograms_in_batch=self.maximum_histograms_in_batch File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/ClusterHistogramAggregator.py", line 87, in aggregate output_histogram=final_output_histogram File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/ClusterHistogramAggregator.py", line 207, in hadd_on_cluster_node output_dir=output_dir File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/sbatchManager.py", line 194, in submit_job_version2 job_id = sbatch_command_result.split()[-1] IndexError: list index out of range make: *** [/hdfs/local/ttH_2tau/veelken/ttHAnalysis/2016/2017Mar13/histograms/jetToTauFakeRate/histograms_harvested_stage1_jetToTauFakeRate_MuonEG_Run2016C_v1_OS.root] Error 1

[**]

/home/veelken/ttHAnalysis/2016/2017Mar13/logs/jetToTauFakeRate/OS/TTZToLL_M10_ext1/analyze_jetToTauFakeRate_TTZToLL_M10_ext1_OS_central_2_executable.log

mxrguspxrt commented 7 years ago
  1. Slurm tasks are all in the usage?
  2. Slurm complains that file does not exist?
  3. What causes this? Log it better.
  4. Run code again, should see exact error.

Expecting that ret value is "somevalue 0", but it has other format.

ktht commented 7 years ago

My 0.02:

The conclusions you draw from error logs in [*] are not exactly true IMO. Even though the file [1] gives an impression that hadd command is carried out directly on /hdfs, it's not the case. If you follow the code you can see that it calls sbatchManager::hadd_in_cluster -> ClusterHistogramAggregator::create_output_histogram -> aggregate -> hadd_on_cluster_node -> sbatchManager::submit_job_version2, so it always performs hadd on /scratch, and then copies the output file over to /hdfs.

My guess is that there was an error on SLURM side, which didn't return anything to stdout. I followed the folder structure of your analysis on /hdfs and saw that the shell files which do the heavy lifting of hadding on /scratch are created on the fly and immediately passed to SLURM. In practice the file seems to be empty shortly after writing it to the storage, even though OS has told us that this is not the case (fflush and fsync are now forced every time we write a shell script). Thus, SLURM complains that it sees an empty file, doesn't submit the job, hence nothing in stdout. The reason why the text files look like empty ones immediately after writing them could be anything, really, but the fix is placing these files to /home instead of /hdfs. This has worked for me so far in addMEM production. It should be fixed by now (see commit https://github.com/HEP-KBFI/tth-htt/commit/5e8e8d2d13825f4b40008f78eca53051b34c9887) but I haven't tested it.. Now that we have a bit more logging about this stuff in main stdout file, you can see what sbatch actually returns once the problem pops up again.

Regarding the second issue, I think parsing the log files makes it really complex (what to parse? what if you miss something and carry on as if everything was normal? what about the performance penalty of parsing?). In my opinion, better way is to log the return code inside the shell script that is passed to SLURM. I already see something similar done in ClusterHistogramAggregator, we just have to sprinkle more $? inside the shell scripts (hadd might also return a non-zero exit code!). This needs some testing, since I'm not sure what the return codes will be if the program throws. Also, ROOT masks some internal errors and just prints them on screen, but we want to catch them as well; I think we have to create some kind of callback which checks the global error code of ROOT (if there's such thing) and return non-zero status code accordingly. Finding out the scenarios which might fail or which we have to/can cover takes a bit time, but the solution itself is really simple.

[1] /home/veelken/ttHAnalysis/2016/2017Mar13/scripts/jetToTauFakeRate/sbatch_hadd_jetToTauFakeRate_stage1_DoubleMuon_Run2016E_v1_OS.py

ktht commented 7 years ago

Alright, so I did some digging and learned that in C++, any uncaught exception calls std::terminate(), which in turn calls std::abort() (ref1, ref2). The latter issues SIGABRT signal with code 6; the exit code is therefore 128 + 6 = 134. You can convince yourself by running the following snippet in your terminal:

echo "int main() { throw \"uncaught\"; return 0; }" > uncaught.cpp; \
g++ -O0 uncaught.cpp -o uncaught; \
$(./uncaught &>/dev/null); \
echo $?; \
rm uncaught*
# displays 134

As for the ROOT error messages, which we don't want to ignore (I think), there's a simple fix:

#include <TError.h> // gErrorAbortLevel, kError

// ...

int
main()
{
  gErrorAbortLevel = kError;

  // ...

  return EXIT_SUCCESS;
}

This forces ROOT to throw an exception if an error occurs. The default error level at which the binary crumbles is kSysError, which is two levels higher than kError (c.f. the source). If we do this in every executable that uses ROOT, the program throws e.g. when there's an input file missing (which is desired behavior IMO).

The third pieces of the puzzle -- hadd -- sets the exit code to 0 if it succeeded and to 1 if it didn't.

To recap:

  1. set the abort levels to error levels in every file under bin/ which deals w/ ROOT files;
  2. modify ClusterHistogramAggregator.py such that it checks the return code of hadd;
  3. modify sbatch-node.submit_job_version2_template.sh such that it checks the return code of the main binary.

By ,,check'' I mean that if an error has occured, the /scratch subdirectory is properly cleaned up and non-zero return code is issued at the end of the job. The latter information can be useful by checking the job via sacct which is another way of checking the job completion.

I can take up this task myself.

ktht commented 7 years ago

Aand one more thing I completely forgot: the issue of logs files under /home. As you probably know, the log files become huge and eat up the /home directory really fast (for instance, recent addMEM batch generated 26GB worth of log and cfg files). The solution is to archive the project & delete the archived files manually or by running this script: scripts/archive.py. The usage should be intuitive; example

export PATH_TO_ARCHIVE=~karl/addMEM/2016/2017Mar13
./scripts/archive.py -p $PATH_TO_ARCHIVE -v

This will gzip+tar everything under $PATH_TO_ARCHIVE directory and places the archive there. It works only if $PATH_TO_ARCHIVE resides on /home. You can also run this script by copying it to the directory which you want to to archive ($PATH_TO_ARCHIVE in our case) and simply running ./archive.py -v. Extracting is easy: just append -u flag in above examples.

I might look into grep-ping archived files, b/c unarchiving is inconvenient and eats up quarter of /home whenever you want to revisit the log files for whatever reason. Default utilities like zgrep aren't powerful enough, thus it must be scripted.

veelken commented 7 years ago

Hi Karl,

thank you very much for looking into this.

I am not sure I understand the comments that you make about return codes. In the case at hand, the analysis jobs finished with an exception. If I understand correctly, we need to pass the return code corresponding to the exception to the sbatchManager and then to the Makefile, right ? I agree that this would be a much better approach than parsing the log-files.

I have run ‘git pull’ yesterday evening and rerun the hadd step on /hdfs and I now see a more detailed error message. So, the problem indeed seems to be that the script for the SLURM job is not written to /hdfs "fast enough”. I fully agree with your proposal to write the hadd scripts to /home . Should this already have been the case with yesterday’s version of the software or is this still something that needs to be implemented and/or validated ? I send you the detailed error message attached below [*].

Cheers,

Christian


[*]

output_histogram=output_histogram

File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/ClusterHistogramAggregator.py", line 209, in hadd_on_cluster_node output_dir=output_dir File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/sbatchManager.py", line 200, in submit_job_version2 raise IndexError("Caught an error: '%s'" % sbatch_command_result[1]) IndexError: Caught an error: 'sbatch: error: Batch script is empty! ' Traceback (most recent call last): File "/home/veelken/ttHAnalysis/2016/2017Mar13/scripts/jetToTauFakeRate/sbatch_hadd_jetToTauFakeRate_stage1_SingleElectron_Run2016C_v1_OS.py", line 9, in auxDirName='', File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/sbatchManager.py", line 75, in hadd_in_cluster cluster_histogram_aggregator.create_output_histogram() File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/ClusterHistogramAggregator.py", line 27, in create_output_histogram maximum_histograms_in_batch=self.maximum_histograms_in_batch File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/ClusterHistogramAggregator.py", line 63, in aggregate output_histogram=output_histogram File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/ClusterHistogramAggregator.py", line 209, in hadd_on_cluster_node output_dir=output_dir File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/sbatchManager.py", line 200, in submit_job_version2 raise IndexError("Caught an error: '%s'" % sbatch_command_result[1]) IndexError: Caught an error: 'sbatch: error: Batch script is empty! ' make: [/hdfs/local/ttH_2tau/veelken/ttHAnalysis/2016/2017Mar13/histograms/jetToTauFakeRate/histograms_harvested_stage1_jetToTauFakeRate_DoubleMuon_Run2016E_v1_OS.root] Error 1 make: Waiting for unfinished jobs.... make: [/hdfs/local/ttH_2tau/veelken/ttHAnalysis/2016/2017Mar13/histograms/jetToTauFakeRate/histograms_harvested_stage1_jetToTauFakeRate_SingleElectron_Run2016C_v1_OS.root] Error 1 Traceback (most recent call last): File "/home/veelken/ttHAnalysis/2016/2017Mar13/scripts/jetToTauFakeRate/sbatch_hadd_jetToTauFakeRate_stage1_SingleElectron_Run2016F_v1_OS.py", line 9, in auxDirName='', File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/sbatchManager.py", line 75, in hadd_in_cluster cluster_histogram_aggregator.create_output_histogram() File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/ClusterHistogramAggregator.py", line 27, in create_output_histogram maximum_histograms_in_batch=self.maximum_histograms_in_batch File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/ClusterHistogramAggregator.py", line 63, in aggregate output_histogram=output_histogram File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/ClusterHistogramAggregator.py", line 209, in hadd_on_cluster_node output_dir=output_dir File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/sbatchManager.py", line 200, in submit_job_version2 raise IndexError("Caught an error: '%s'" % sbatch_command_result[1]) IndexError: Caught an error: 'sbatch: error: Batch script is empty! ' make: [/hdfs/local/ttH_2tau/veelken/ttHAnalysis/2016/2017Mar13/histograms/jetToTauFakeRate/histograms_harvested_stage1_jetToTauFakeRate_SingleElectron_Run2016F_v1_OS.root] Error 1 '

On Mar 16, 2017, at 8:58 PM, Karl Ehatäht notifications@github.com wrote:

My 0.02:

The conclusions you draw from error logs in [*] are not exactly true IMO. Even though the file [1] gives an impression that hadd command is carried out directly on /hdfs, it's not the case. If you follow the code you can see that it calls sbatchManager::hadd_in_cluster -> ClusterHistogramAggregator::create_output_histogram -> aggregate -> hadd_on_cluster_node -> sbatchManager::submit_job_version2, so it always performs hadd on /scratch, and then copies the output file over to /hdfs.

My guess is that there was an error on SLURM side, which didn't return anything to stdout. I followed the folder structure of your analysis on /hdfs and saw that the shell files which do the heavy lifting of hadding on /scratch are created on the fly and immediately passed to SLURM. In practice the file seems to be empty shortly after writing it to the storage, even though OS has told us that this is not the case (fflush and fsync are now forced every time we write a shell script). Thus, SLURM complains that it sees an empty file, doesn't submit the job, hence nothing in stdout. The reason why the text files look like empty ones immediately after writing them could be anything, really, but the fix is placing these files to /home instead of /hdfs. This has worked for me so far in addMEM production. It should be fixed by now (see commit 5e8e8d2 https://github.com/HEP-KBFI/tth-htt/commit/5e8e8d2d13825f4b40008f78eca53051b34c9887) but I haven't tested it.. Now that we have a bit more logging about this stuff in main stdout file, you can see what sbatch actually returns once the problem pops up again.

Regarding the second issue, I think parsing the log files makes it really complex (what to parse? what if you miss something and carry on as if everything was normal? what about the performance penalty of parsing?). In my opinion, better way is to log the return code inside the shell script that is passed to SLURM. I already see something similar done in ClusterHistogramAggregator, we just have to sprinkle more $? inside the shell scripts (hadd might also return a non-zero exit code!). This needs some testing, since I'm not sure what the return codes will be if the program throws. Also, ROOT masks some internal errors and just prints them on screen, but we want to catch them as well; I think we have to create some kind of callback which checks the global error code of ROOT (if there's such thing) and return non-zero status code accordingly. Finding out the scenarios which might fail or which we have to/can cover takes a bit time, but the solution itself is really simple.

[1] /home/veelken/ttHAnalysis/2016/2017Mar13/scripts/jetToTauFakeRate/sbatch_hadd_jetToTauFakeRate_stage1_DoubleMuon_Run2016E_v1_OS.py

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/HEP-KBFI/tth-htt/issues/17#issuecomment-287158202, or mute the thread https://github.com/notifications/unsubscribe-auth/AEwCTk25wjjDRtpXdAvfZQrOuskKR9T9ks5rmYZlgaJpZM4MfLRm.

veelken commented 7 years ago

Hi,

another analysis job for which I store the .root files with the histograms on /home failed yesterday.

This is what I called “racing condition” when we talked yesterday.

It seems to me that the error occurs if there are multiple hadd jobs that start at very much the same time and try to create a directory of the same name, because the directory did not exist before the jobs were started. I think what we need to do is to

Does that make sense to you, Margus ?

Cheers,

Christian


File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/ClusterHistogramAggregator.py", line 27, in create_output_histogram output_histogram=output_histogram File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/ClusterHistogramAggregator.py", line 209, in hadd_on_cluster_node maximum_histograms_in_batch=self.maximum_histograms_in_batch File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/ClusterHistogramAggregator.py", line 63, in aggregate output_dir=output_dir File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/sbatchManager.py", line 163, in submit_job_version2 output_histogram=output_histogram File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/ClusterHistogramAggregator.py", line 209, in hadd_on_cluster_node scratch_dir = self.get_scratch_dir() File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/sbatchManager.py", line 212, in get_scratch_dir output_dir=output_dir create_if_not_exists(scratch_dir) File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/jobTools.py", line 41, in create_if_not_exists File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/sbatchManager.py", line 163, in submit_job_version2 scratch_dir = self.get_scratch_dir() File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/sbatchManager.py", line 212, in get_scratch_dir create_if_not_exists(scratch_dir) File "/home/veelken/VHbbNtuples_8_0_x/CMSSW_8_0_19/python/tthAnalysis/HiggsToTauTau/jobTools.py", line 41, in create_if_not_exists if not os.path.exists(dir_fullpath): os.makedirs(dir_fullpath) if not os.path.exists(dir_fullpath): os.makedirs(dir_fullpath) File "/cvmfs/cms.cern.ch/slc6_amd64_gcc493/cms/cmssw/CMSSW_8_0_19/external/slc6_amd64_gcc493/bin/../../../../../../external/python/2.7.6-kpegke/lib/python2.7/os.py", line 157, in makedirs File "/cvmfs/cms.cern.ch/slc6_amd64_gcc493/cms/cmssw/CMSSW_8_0_19/external/slc6_amd64_gcc493/bin/../../../../../../external/python/2.7.6-kpegke/lib/python2.7/os.py", line 157, in makedirs mkdir(name, mode) OSError: [Errno 17] File exists: '/scratch/veelken/tthAnalysis_2017-03-17' mkdir(name, mode) OSError: [Errno 17] File exists: '/scratch/veelken/tthAnalysis_2017-03-17' make: *** [/home/veelken/ttHAnalysis/2016/2017Mar16/histograms/2lss_1tau/histograms_harvested_stage1_5_2lss_1tau_FakeablewFakeRateWeights

ktht commented 7 years ago

Should this already have been the case with yesterday’s version of the software or is this still something that needs to be implemented and/or validated ?

From to error message it seems that you need to update your local repository once again and re-run the cfg creation. The current version doesn't include newline characters at the end of stdout/stderr messages returned by run_cmd.

About the return codes: yes, I think the way to go is:

  1. if the program throws upon some kind of error (e.g. input ROOT file doesn't exists);
  2. the return code will be a non-zero number, which can be caught in the shell scripts and return it again;
  3. this can be further propagated to sbatchManager via sacct, which lets you check the status code of the job;
  4. sbatchManager also exits with a non-zero return code;
  5. Makefile should fail, b/c it expects zero return code of each command it executes, and fails if it's not so.

TBF, I'm not entirely sure about the last point, though. Needs some testing.

As for the racing condition, I'd vote for this solution which is basically the same you proposed.

veelken commented 7 years ago

Hi Karl,

I confirm that the Makefile will stop once sbatchManager returns a non-zero error code.

Margus, could you have a look at implementing this ?

Cheers,

Christian

On Mar 17, 2017, at 12:04 PM, Karl Ehatäht notifications@github.com wrote:

Should this already have been the case with yesterday’s version of the software or is this still something that needs to be implemented and/or validated ?

From to error message it seems that you need to update your local repository once again and re-run the cfg creation. The current version doesn't include newline characters at the end of stdout/stderr messages returned by run_cmd.

About the return codes: yes, I think the way to go is:

if the program throws upon some kind of error (e.g. input ROOT file doesn't exists); the return code will be a non-zero number, which can be caught in the shell scripts and return it again; this can be further propagated to sbatchManager via sacct, which lets you check the status code of the job; sbatchManager also exits with a non-zero return code; Makefile should fail, b/c it expects zero return code of each command it executes, and fails if it's not so. TBF, I'm not entirely sure about the last point, though. Needs some testing.

As for the racing condition, I'd vote for this solution http://stackoverflow.com/a/600612 which is basically the same you proposed.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/HEP-KBFI/tth-htt/issues/17#issuecomment-287314447, or mute the thread https://github.com/notifications/unsubscribe-auth/AEwCTvU2rNbshc56s6HGZEefs3rH_J5wks5rmlqXgaJpZM4MfLRm.

ktht commented 7 years ago

The race condition during the directory creation is now fixed: https://github.com/HEP-KBFI/tth-htt/commit/b98a58996269d9541eeeb56d18f302fcc42cd9bb (and tested locally, see [*]).

Coming back to the propagation of return code: if sbatchManager might catch a non-zero return code, and therefore exit abruptly, should we scancel all running jobs automatically or let them run (and kill them manually if needed)?

(P.S. It would be nice if we create a separate issue here for every problem we encounter. Currently we have three different problems in this thread and it's a bit cumbersome to discuss all of them at once.)

[*]

import uuid, os, errno, multiprocessing

def mkdir_p(path):
  try:
    os.makedirs(path)
  except OSError as exc:  # Python >2.5
    if exc.errno == errno.EEXIST and os.path.isdir(path):
      pass
    else:
      raise

dirnames = [str(uuid.uuid4()) for i in range(1000)]

def work(dirnames):
  for d in dirnames:
    mkdir_p(d)

if __name__ == '__main__':

  nof_jobs = 50

  p = multiprocessing.Pool(nof_jobs)
  p.map(work, [dirnames for i in range(nof_jobs)])
  # delete the directories with
  # find . -type d | xargs -r rmdir
veelken commented 7 years ago

Hi Karl,

thank you. From my point of view there is no need to scancel all running jobs. The probability is high that many jobs will fail with the same error condition as the first job, but if they fail, they typically fail before or on the first event, so we do not waste much processing time by not scancel'ing them. Besides the implementation effort, scancel'ing all jobs may introduce a new type of problem (a temporary failure of a single job may scancel many hundred jobs that may not be affected by the temporary failure and would have run fine. Remember the problem we had in the past with some machines temporarily not being able to load CMSSW).

ktht commented 7 years ago

I tinkered with the problem, but it still needs some testing ,,in the wild''. The solution is somewhat similar to what I've described in the above posts, but I had to reconsider some points:

  1. Instead of grepping squeue by the job IDs, I found a bit more efficient solution by grepping the comment field, that should be assigned to every job. The output of squeue is trimmed down as much as possible so that it could be post-processed more easily.
  2. The comment is the same for all jobs within the job pool, but unique to each job pool. Currently the comments are randomly generated UUID strings, but in principle it can be something else unique enough (e.g. channel type + version) and it must be specified in tthAnalyzeRun*/tthAddMEM* files under test directory. All jobs which are generated on the fly share the same unique comment as the ,,parent'' sbatchManager instance.
  3. The feature -- detecting jobs which finished with a non-zero return code -- had multiple problems:
    1. Even though sacct work pretty well, it might be the case that under heavily load it (or rather its backend DB that stores information about every job) simply becomes unavailable.
    2. The fallback solution to this is scontrol, but there's no easy way to reformat the huge output it gives in the same shell command and thus it must be carried out by sbatchManager itself (i.e. it's a bit slow).
    3. It can also be the case that scontrol is unable to retrieve any information about recently finished jobs, because the information is not stored in its cache anymore (I have no idea, for how long or on what grounds the information is stored in the cache). There's no fallback solution to this problem, but I assume that by default the jobs finish successfully, even though we cannot verify it by the means SLURM provides (remember that there already are some safeguards in the shell wrapper and Makefile itself).
    4. grepping by comment here becomes unviable (especially in sacct case), because the number of finished jobs may grow to huge numbers which makes it slow to read from subprocess.PIPE. So I had to implement the same squeue polling trick as in the original sbatchManager (grep the job ID's, but split them into reasonable-sized chunks to avoid ARG_MAX kernel limit; the limit of 500 job IDs was too small, though).
  4. Added some comments to the new class and made it a bit more readable than the current working version.

I have written a couple of test cases for it (in ./test/sbatch):

  1. Two minimal jobs that end with zero and non-zero exit code.
  2. 100 minimal jobs that finish with zero exit code; 100 minimal jobs from which only the last one exits with a non-zero return code (and hence should raise an error from sbatchManager).

The tests pass successfully. The only aspect I haven't tested with this approach yet is haddding in sbatchManager. But I guess if it's going to work in Ntuple/addMEM production, we can skip that.

The next step is to actually run the solution in addMEM production. If the jobs finish without any problems regarding new sbatchManager then I think we can replace the original sbatchManager class with the new one.

edit: a quick test shows that scontrol holds information about a job up to 15 minutes since the time it finished. We poll every 30 seconds for recently finished jobs so we never need a third fallback solution (unless there's a limit on the number of jobs in the cache). Quick googling about the config settings in SLURM yielded nothing.

edit2: OK so I tested again and it looks like the jobs are indeed kept 15 mins (plus some change) in the cache. The SLURM variable that controls it is perhaps MinJobAge (currently set to 900, see /etc/slurm/slurm.conf). The user guide (pdf) seems to confirm my suspicions about sacct and scontrol:

SLURM provides a built-in caching capability as a redundancy mechanism in the event that communication with the database is lost. When communication with either the slurmdbd or the mysqld is lost, the SLURM controller caches accounting information. One of the key components in this approach is the usage of the directory designated by the SLURM.conf control argument 'StateSaveLocation'. If the SlurmCTLD is unable to communicate with the SlurmDBD, it utilizes an internal cache until SlurmDBD is returned to service or until the cache is filled. Observations show that there is space in this cache for approximately 5000 simple jobs.

TBF, the figure 5000 is kind of arbitrary in my opinion. But hopefully it's a non-issue because the number of jobs that could successfully finish in less than a minute is probably less than O(5k) anyways. Or if it becomes a problem we have to partition our jobs such that we won't submit that many jobs in one go or something.

ktht commented 7 years ago

Switched over to the new version of sbatchManager. Tests passed and tried it out on addMEM & analyze_2lss_1tau jobs.