Closed IzaakWN closed 3 years ago
Thanks @IzaakWN . We are checking if we can safely remove this line.
Hi @gouskos, do we already know if this line can be safely removed (in general), or if self.firstEntry
can be replaced with 0
? If so, I will go ahead with it in my private branch.
https://github.com/cms-nanoAOD/nanoAOD-tools/blob/25a793ec55b30fe7107af263c4523f20ff1a5fbd/python/postprocessing/framework/output.py#L175-L176
Hi @IzaakWN from the tests I did it looks safe to me to remove the line. @aalbert, since you added the line, do you have any comments removing it?
@gouskos, @AndreasAlbert When removing this line, the branches that are dropped in the "keep and drop" file via outputbranchsel
are not removed, so CopyTree
is actually needed to get rid of the deactivated branches. Maybe a better way to fix the bug reported above, is by changing it to
self._tree = self.tree().CopyTree('1', "",
self.maxEntries if self.maxEntries else ROOT.TVirtualTreePlayer.kMaxEntries,0)
but I am not 100% this covers all cases?
Hi apologies for being super late to the party here. I personally do not have a great understanding of this piece of code. When I initially made the changes that introduced this bug, I was mainly copy/pasting the logic from other pieces in the code. It seems to me that just removing the line altogether would revert the goal that I initially had, as @IzaakWN points out: Making sure that the keep/drop statements are also used for branches that are created by nanoaod-tools (i.e. not existant in the original inputs yet).
It seems to me that the first solution proposed by @IzaakWN is promising: Just changing the firstentry to 0. I never tested or even considered the maxEntries settings when making the original PR, so I did not realize that the tree is already prefiltered. It further seems to me that this change cannot break anything if we verify that the number of entries being processed is consistent after the change (which is what @IzaakWN seems to have done).
Thanks for your input, @AndreasAlbert!
@gouskos, as far as I can tell, the output tree is already filtered by this line in all my personal use cases. Do you know what combinations of settings we would need to test to convince ourselves this change is safe for all possible cases? firstEvent=0
by default, but do we know if someone is using this option at the moment?
To make it a bit more complicated, there is another slightly related bug that happens in the special case where the user passes no modules (fullClone==True
), but passes a JSON file via jsonInput
, and/or a cut via cut
.
For sake of example, say the user passes firstEntry=100, maxEntries=100
and a JSON file, then in the preskim step,
https://github.com/cms-nanoAOD/nanoAOD-tools/blob/f00eb4ea4cdec1c957c53ed1ad56eabf160c3cd0/python/postprocessing/framework/preskimming.py#L81-L84 you will have an event list elist
that looks like [100,...,199]
minus events not in JSON. Without a JSON or cut, this elist
is None
. If elist
exists, this list is used to filter the input file:
https://github.com/cms-nanoAOD/nanoAOD-tools/blob/f00eb4ea4cdec1c957c53ed1ad56eabf160c3cd0/python/postprocessing/framework/postprocessor.py#L190-L196 The problem is that this seems to effectively shift the event indices back to 0. Event 100 is now at index 0, event 199 is now at index 99, and so on. In this line then,
https://github.com/cms-nanoAOD/nanoAOD-tools/blob/f00eb4ea4cdec1c957c53ed1ad56eabf160c3cd0/python/postprocessing/framework/output.py#L126-L130 if fullClone==True
and the input tree was pre-filtered, you will end up with zero events in this example, because firstEntry=100>=100=maxEntries
.
One way to solve this special case, is to reset firstEntry
to 0 if fullClone and elist
before passing it to FullOutput
, e.g.
firstEntry = 0 if fullClone and elist else self.firstEntry
Hi @IzaakWN @AndreasAlbert - thanks for all the investigations. I also tried all possible variations that I could think of and indeed changing firstEntry to 0 in:
self._tree = self.tree().CopyTree('1', "", self.maxEntries if self.maxEntries else ROOT.TVirtualTreePlayer.kMaxEntries,firstEntry)
@IzaakWN I suggest you to go ahead and push this change.
Concerning this:
To make it a bit more complicated, there is another slightly related bug that happens in the special case where the user passes no modules (fullClone==True), but passes a JSON file via jsonInput, and/or a cut via cut.
I suggest we address it later. I did not have the chance to test the proposed change. I will create a separate issue pointing to this discussion
Thanks you for the help and merge, @gouskos. I'll close this issue and open a separate one for the fullClone and elist
case.
Thanks to you @IzaakWN . Yes - please do so and we will follow-up.
When using the postprocessor with
firstEntry
andmaxEntries
that are both larger than zero, I noticed that the output tree has no events iffirstEntry
exceedsmaxEntries
. For example when you have a nanoAOD file with originally 10000 events, and you split it up into 10 jobs withand so on, only the first setting will have 1000 events. If you do
the output tree will have 990 events and so on.
After doing some printout, it seems that this reduction happens in this line: https://github.com/cms-nanoAOD/nanoAOD-tools/blob/25a793ec55b30fe7107af263c4523f20ff1a5fbd/python/postprocessing/framework/output.py#L175-L176 (see commit https://github.com/cms-nanoAOD/nanoAOD-tools/commit/96319365d9c1eacf9e88bef9a8a035b40561d376 by @AndreasAlbert) Namely, Before the copy entry,
self._tree
already hasself.maxEntries
events (see [1-2]). Ifself.firstEntry
is larger than 0, you will only copy eventsself.firstEntry
up toself.maxEntries
of this tree, so you lose the firstself.firstEntry
events.Can this line be removed without breaking other cases? If it is needed for something else (
provenance=True
?), maybeself.firstEntry
should be replace with0
in this line.[1]
fullClone
https://github.com/cms-nanoAOD/nanoAOD-tools/blob/25a793ec55b30fe7107af263c4523f20ff1a5fbd/python/postprocessing/framework/output.py#L126-L128[2]
not fullClone
https://github.com/cms-nanoAOD/nanoAOD-tools/blob/25a793ec55b30fe7107af263c4523f20ff1a5fbd/python/postprocessing/framework/eventloop.py#L74-L76