Run Ntuple production and analysis on cluster

ktht commented 1 year ago

Quick back-of-the-envelope calculation: if we want to process 100M events and it takes 3h to produce an Ntuple of 100k events (assuming that we don't filter by tau decay modes), then it requires 3000h CPU hours of computing time, which with the current implementation (of 12 parallel jobs on the host machine) would take approximately 10 days to run.

veelken commented 1 year ago

Hi Karl, good point!

What I believe needs to be done is to extend the script [1] such that it generates the shell scripts for the batch submission. The config files for cmsRun don't need to be changed when we run the Ntuple production jobs on the batch system, I believe. The shell scripts for the batch submission need to contain the lines that get currently get written to the Makefile. I suggest you add a boolean flag to the script that allows to switch between Makefike and batch submission mode.

The script for the analysis jobs [3] probably needs to be extended in the same way. One concern that I have is that the estimation of statistical uncertainties with bootstrapping takes a significant amount of computing time. I don't see an easy way how the bootstrapping can be parallelized. We can gain a factor 10 in speed by reducing the number of toys from 1000 to 100, by changing the line [4] in the config file. I hope this factor 10 will be sufficient to run the analysis jobs in reasonable time without changes to the C++ code.

[1] https://github.com/HEP-KBFI/tautau-Entanglement/blob/main/test/produceNtuples.py [2] https://github.com/HEP-KBFI/tautau-Entanglement/blob/main/test/produceNtuples.py#L102-L107 [3] https://github.com/HEP-KBFI/tautau-Entanglement/blob/main/test/analyzeNtuples.py [4] https://github.com/HEP-KBFI/tautau-Entanglement/blob/main/test/analyzeEntanglementNtuple_cfg.py#L45

ktht commented 1 year ago

Some changes:

Ntuple production can be now run on the cluster;
both produceNtuples.py and analyzeNtuples.py require command line arguments. Run with -h/--help to see the available options. It has the added benefit that one can now run multiple analyses on the same Ntuples but with a different version. For example, if your Ntuples have version X and you want to run analysis with version Y, then you just specify -v Y -V X;
analyzeEntanglementNtuple_cfg.py, produceEntanglementNtuple_cfg.py, makeControlPlots_cfg.py and makeResolutionPlots_cfg.py are all turned into Jinja2 templates. This is because the substitution with sed doesn't work if the string you're trying to replace with is very long. The new solution is also more flexible as it let's the user to supply different arguments depending on which config files is being generated at a given time;
removed os.command() statements because they fail silently;
removed boilerplate, moved to f-strings, log memory consumption.

ktht commented 1 year ago

As I anticipated, accessing data from bootstrap samples via pointers or indices instead of const-references as it is right now did not reduce runtime whatsoever. In fact, I got more noticeable (5%) reduction in runtime by compiling the code with more aggressive optimization flags, which are now made the default. As far as speedup is concerned I don't see what else could be done here (aside from distributing bootstrap sampling over multiple jobs, but that's a major task), so closing the issue.

ktht commented 1 year ago

It turns out that the analysis jobs consume quite a bit of memory to the point that they die when run on cluster. There seems to be a strong correlation with tau decay modes: pi_pi jobs finished whereas rho_rho jobs consume the most amount of memory (3GB). I'm not entirely sure if we can do anything about it though because the memory consumption scales with the amount of data we're reading in. Some jobs even fail while reading the data. I'll run some jobs locally to determine the peak memory consumption. At the moment I'm contemplating changing the memory limit on all jobs other than pi_pi when run on cluster.

veelken commented 1 year ago

Hi Karl,

I believe the issue is that the analysis code [1] creates one object of type Data [2] for each event that is read from the Ntuples. The Data object holds 7 floating-point numbers (28 bytes). If we our MC sample contains 200mio events, there will be about 12.5mio events in the rho_rho decay channel, resulting in a memory consumption of 350 Mb. This is a sizeable amount of memory, but still below the 2Gb limit for regular batch jobs.

Please confirm that you have commented-out lines 292, 296, 297 of the analysis code. My understanding is that std::vector->push_back() creates additional copies of the Data objects, which will increase the memory consumption of the analysis jobs by a factor 4, if these lines are not commented-out.

[1] https://github.com/HEP-KBFI/tautau-Entanglement/blob/main/bin/analyzeEntanglementNtuple.cc#L288 [2] https://github.com/HEP-KBFI/tautau-Entanglement/blob/main/interface/Data.h

ktht commented 1 year ago

Indeed, if you push into std::vector<> then it creates a copy but I didn't realize we were making unnecessary copies. The lines you highlighted turned into comments after I merged your changes from 3prong branch into the main branch, so no superfluous copies of Data are created now. I tested the code by running an ML fit job on rho-rho events that have been reconstructed using kinematic fit. Here's a plot that shows the memory consumption as a function of runtime: PrMon_wtime_vs_vmem_pss_rss_swap_after_all_final

As seen from the plot, the decay mode with the largest branching ratio requires less than 1.5 GB of RSS memory. However, the plot also shows that it took roughly 18h to finish. Although it's well within the 48h limit of our cluster, the cluster nodes are significantly slower than the main host. I tried to speed things up by only reading the relevant branches from the input Ntuples (see 9a7f801e5774e097824860f7fbe9c56df501911a), but as seen from the above plot, it takes an hour or less to transfer data from disk into memory (or just a few minutes (!) if the data is cached by the file system), which constitutes a small fraction of total runtime. This seems to be a feature of ML fit and 1D cross section method (which also runs some fits) since other methods are >10x faster.

My other concern is that when jobs are run in no specific tau decay mode but on fully hadronic decay modes (i.e., on those events that we read from had_had tree), the jobs might still exceed the 2GB limit. I haven't been able to test this yet because I discovered that the had_had tree contained every event of the original sample and not the events where both taus decay hadronically. I've now fixed the problem in 17b6dcff031736c6f51b6ac58ea82a59b79e6519 but I'd still have to produce new Ntuples. Preliminary testing and back-of-the-envelope calculations tell me that the jobs might need at least 3 GB of RSS memory, however. The only place where we could be more memory-efficient (at least in theory) would be where we build bootstrap samples. We could use pointers, indices or std::reference_wrapper as a more elegant solution. If that doesn't pan out, then we could just run those particular jobs locally since we only need to run two jobs to obtain the results we need for the paper.

ktht commented 1 year ago

Final testing revealed that the only jobs running out of memory are those that run on all hadronic decay modes in one go (i.e., had_had). Those jobs have to be run locally, while all other jobs can be run on the cluster. I've made it such that test/analyzeNtuples.py runs on all decay modes separately by default except for the fully hadronic one. The user is expected to specify -d had_had -j local when they want to run those jobs locally.

HEP-KBFI / tautau-Entanglement

Run Ntuple production and analysis on cluster #3