Optimize Trigger Ntuples Trigger Storage

kkrizka commented 6 years ago

I am looking at reducing the size of my ntuples. I made some quick plots looking at the space different branches take (via TBranch::GetTotalSize()). I split the branches into categories based on the word before the first _. If the word is not jet, fatjet, muon, el or ph, then it is put into the event category.

I put the composition of my data ntuples at the bottom. The event category takes up about 20% of the ntuples. Of that, over half is taken up by triggerNames (I run with #1184 applied, the branch is isPassedBitsNames in master). Probably not too surprising, since each trigger is stored as a lengthy set of characters (up to 20 for the large-R jet triggers). If you have several triggers, things add up...

Might be worth rethinking about how the trigger information is stored. My first thought is to have a boolean branch per trigger named triggername (or a float triggername_prescale). Similar to what the old NTUP_COMMON used. Might be faster, since one does not have to do a linear search through a list to determine a trigger decision. Not sure how nice this would be if the complete trigger list is not known at run time (ie: triggers added/removed for the different data periods).

@kratsg @ntadej Thoughts? Maybe I am the only one who stores a lot of trigger decisions (~50)....

Imgur Imgur

kratsg commented 6 years ago

Yeah, I'm not sure how easily feasible this is. Not many people are going to be good enough to be able to do trigger bit decisions at the ntuple level, especially for those who are joining ATLAS now. In most analyses, I only see ~5-10 trigger decisions being stored. Storing 50 does seem like a lot... It's an interesting thought. If you already specify the list of triggers you want to store, is it possible to store a function that calculates the trigger bit given a series of trigger names, and then you can search for that?

fscutti commented 6 years ago

Hi @kratsg, are you suggesting to add the output of this function in addition to what @kkrizka suggests? I feel like just adding this output would reduce the freedom of the user downstream to experiment with different trigger lists. This is especially true if common ntuples are produced in an analysis. If we decide for this combined approach, may I suggest to store a vector for each trigger, where the first element is the trigger bit and the second the prescale?

kratsg commented 6 years ago

@fscutti so no. What this effectively amounts to is requiring a consistent way of mapping input triggers to a fixed vector of trigger strings so that you just store a vector of prescales per event knowing that the order of the vector is well-defined... similarly with trigger bits for passing. The question really is, how do we sort/predetermine that order in an entirely generic / configurable way that doesn't place undue burden on the end user?

An example is to provide a python script that parses the config.py/config.json someone uses, extracts the trigger, and provides the necessary order... but then keeping that up to date with the C++ code becomes somewhat hard to do.

The other option might be to use a friend tree -- where the friend tree has a single row listing the trigger stings, and if you want to get the trigger names into your trees, just add a friend tree to link things up (join).

beojan commented 6 years ago

You could use an std::unordered_map instead of a vector. Then you would only need a single, general, map of trigger names to numeric id's.

You could even map from an enum class, though this would require providing a (trivial) specialization of std::hash to be C++11 compatible.

beojan commented 6 years ago

Edit isn't working.

enum class is probably a bad idea, given the sheer number of triggers there are.

kkrizka commented 6 years ago

Hi all,

I was not proposing to have a single bit string for triggers. I was thinking of a different branch per trigger decision, similar to what was used in the Run 1 ntuples.

-- Karol Krizka

UCATLAS / xAODAnaHelpers

Optimize Trigger Ntuples Trigger Storage #1189