Add pseudo hadd option using TChains

cverstege commented 2 years ago

This commit adds the option to use TChains for combining the TTrees of the GC output files instead of hadding them.

Usage: Add --pseudo-hadd to the excalibur.py call, only working when using xrootd in se path instead of srm

The output is now only a few MB, as the source files form the job outputs (hopefully stored at nrg) are linked. This gets rid of the hadd process and hence saves a lot of time after gc finishes. For me hadd with a target on ceph can take multiple hours for a single run Period (e.g. 2018 Run D > 200GB output). Pseudo hadd only takes a couple of seconds.

Using this compared to a local copy on the ceph mount also yields speed ups when reading the files, e.g. when using Lumberjack (25 cores). For DY MC sample: 364.453 seconds (single file on ceph) to 159.888 seconds (309 files on nrg) For 2018 D: 262.997 seconds (single file on ceph) to 189.587 seconds (380 files on nrg) and less average cpu usage, because of no overhead from the ceph mount

cverstege commented 2 years ago

A lot of changes now, but I think they are worth it.

RHofsaess commented 2 years ago

This is very nice. Thanks for your work! :+1:

KIT-CMS / Excalibur

Add pseudo hadd option using TChains #79