cms-nanoAOD / nanoAOD-tools

Tools for working with NanoAOD (requiring only python + root, not CMSSW)
42 stars 326 forks source link

Bug in use of first entry and maximum number of events: too few events #269

Closed IzaakWN closed 3 years ago

IzaakWN commented 3 years ago

When using the postprocessor with firstEntry and maxEntries that are both larger than zero, I noticed that the output tree has no events if firstEntry exceeds maxEntries. For example when you have a nanoAOD file with originally 10000 events, and you split it up into 10 jobs with

firstEntry=0, maxEntries=1000
firstEntry=1000, maxEntries=1000
firstEntry=2000, maxEntries=1000
...

and so on, only the first setting will have 1000 events. If you do

firstEntry=10, maxEntries=1000

the output tree will have 990 events and so on.

After doing some printout, it seems that this reduction happens in this line: https://github.com/cms-nanoAOD/nanoAOD-tools/blob/25a793ec55b30fe7107af263c4523f20ff1a5fbd/python/postprocessing/framework/output.py#L175-L176 (see commit https://github.com/cms-nanoAOD/nanoAOD-tools/commit/96319365d9c1eacf9e88bef9a8a035b40561d376 by @AndreasAlbert) Namely, Before the copy entry, self._tree already has self.maxEntries events (see [1-2]). If self.firstEntry is larger than 0, you will only copy events self.firstEntry up to self.maxEntries of this tree, so you lose the first self.firstEntry events.

Can this line be removed without breaking other cases? If it is needed for something else (provenance=True?), maybe self.firstEntry should be replace with 0 in this line.

[1] fullClone https://github.com/cms-nanoAOD/nanoAOD-tools/blob/25a793ec55b30fe7107af263c4523f20ff1a5fbd/python/postprocessing/framework/output.py#L126-L128

[2] not fullClone https://github.com/cms-nanoAOD/nanoAOD-tools/blob/25a793ec55b30fe7107af263c4523f20ff1a5fbd/python/postprocessing/framework/eventloop.py#L74-L76

gouskos commented 3 years ago

Thanks @IzaakWN . We are checking if we can safely remove this line.

IzaakWN commented 3 years ago

Hi @gouskos, do we already know if this line can be safely removed (in general), or if self.firstEntry can be replaced with 0? If so, I will go ahead with it in my private branch. https://github.com/cms-nanoAOD/nanoAOD-tools/blob/25a793ec55b30fe7107af263c4523f20ff1a5fbd/python/postprocessing/framework/output.py#L175-L176

gouskos commented 3 years ago

Hi @IzaakWN from the tests I did it looks safe to me to remove the line. @aalbert, since you added the line, do you have any comments removing it?

IzaakWN commented 3 years ago

@gouskos, @AndreasAlbert When removing this line, the branches that are dropped in the "keep and drop" file via outputbranchsel are not removed, so CopyTree is actually needed to get rid of the deactivated branches. Maybe a better way to fix the bug reported above, is by changing it to

self._tree = self.tree().CopyTree('1', "", 
                                   self.maxEntries if self.maxEntries else ROOT.TVirtualTreePlayer.kMaxEntries,0)

but I am not 100% this covers all cases?

AndreasAlbert commented 3 years ago

Hi apologies for being super late to the party here. I personally do not have a great understanding of this piece of code. When I initially made the changes that introduced this bug, I was mainly copy/pasting the logic from other pieces in the code. It seems to me that just removing the line altogether would revert the goal that I initially had, as @IzaakWN points out: Making sure that the keep/drop statements are also used for branches that are created by nanoaod-tools (i.e. not existant in the original inputs yet).

It seems to me that the first solution proposed by @IzaakWN is promising: Just changing the firstentry to 0. I never tested or even considered the maxEntries settings when making the original PR, so I did not realize that the tree is already prefiltered. It further seems to me that this change cannot break anything if we verify that the number of entries being processed is consistent after the change (which is what @IzaakWN seems to have done).

IzaakWN commented 3 years ago

Thanks for your input, @AndreasAlbert!

@gouskos, as far as I can tell, the output tree is already filtered by this line in all my personal use cases. Do you know what combinations of settings we would need to test to convince ourselves this change is safe for all possible cases? firstEvent=0 by default, but do we know if someone is using this option at the moment?

To make it a bit more complicated, there is another slightly related bug that happens in the special case where the user passes no modules (fullClone==True), but passes a JSON file via jsonInput, and/or a cut via cut. For sake of example, say the user passes firstEntry=100, maxEntries=100 and a JSON file, then in the preskim step, https://github.com/cms-nanoAOD/nanoAOD-tools/blob/f00eb4ea4cdec1c957c53ed1ad56eabf160c3cd0/python/postprocessing/framework/preskimming.py#L81-L84 you will have an event list elist that looks like [100,...,199] minus events not in JSON. Without a JSON or cut, this elist is None. If elist exists, this list is used to filter the input file: https://github.com/cms-nanoAOD/nanoAOD-tools/blob/f00eb4ea4cdec1c957c53ed1ad56eabf160c3cd0/python/postprocessing/framework/postprocessor.py#L190-L196 The problem is that this seems to effectively shift the event indices back to 0. Event 100 is now at index 0, event 199 is now at index 99, and so on. In this line then, https://github.com/cms-nanoAOD/nanoAOD-tools/blob/f00eb4ea4cdec1c957c53ed1ad56eabf160c3cd0/python/postprocessing/framework/output.py#L126-L130 if fullClone==True and the input tree was pre-filtered, you will end up with zero events in this example, because firstEntry=100>=100=maxEntries.

One way to solve this special case, is to reset firstEntry to 0 if fullClone and elist before passing it to FullOutput , e.g.

firstEntry = 0 if fullClone and elist else self.firstEntry
gouskos commented 3 years ago

Hi @IzaakWN @AndreasAlbert - thanks for all the investigations. I also tried all possible variations that I could think of and indeed changing firstEntry to 0 in:

self._tree = self.tree().CopyTree('1', "", 
                                   self.maxEntries if self.maxEntries else ROOT.TVirtualTreePlayer.kMaxEntries,firstEntry)

@IzaakWN I suggest you to go ahead and push this change.

Concerning this:

To make it a bit more complicated, there is another slightly related bug that happens in the special case where the user passes no modules (fullClone==True), but passes a JSON file via jsonInput, and/or a cut via cut.

I suggest we address it later. I did not have the chance to test the proposed change. I will create a separate issue pointing to this discussion

IzaakWN commented 3 years ago

Thanks you for the help and merge, @gouskos. I'll close this issue and open a separate one for the fullClone and elist case.

gouskos commented 3 years ago

Thanks to you @IzaakWN . Yes - please do so and we will follow-up.