RNO-G / mattak

RNO-G dataformats
1 stars 1 forks source link

Add interface to iterate only over events of a certain trigger type #6

Closed fschlueter closed 1 year ago

fschlueter commented 1 year ago

Hi @cozzyd,

do you like this kind of feature? I think it would be useful. I am not sure if my implementation is the most elegant and efficient.

However I ran a little speed test with the following code:

    start = time.time() 
    for _ in range(100):
        d.setEntries((0, d.N()))
        for ev in d.iterate(trigger="FORCE"): 
            ei, wfs = ev
            # print(ei.triggerType)
    end = time.time() 
    print ("time:", end-start)

On the main branch pyroot took 55s and uproot 29s. On this branch it took with pyroot 30s and with uproot 29s. (Of course with the main branch you actually access more events). (The reason why uproot does get faster is also pretty obvious, maybe we can improve it)...

cozzyd commented 1 year ago

Yeah, this is a good feature to have!

Strange that your benchmark has the opposite behavior from what I have seen (though maybe I was comparing PyROOT vs the current NuRadioReco implementation rather than the two mattak implementations).

It's a bit hard to make uproot faster here because it's inefficient to get arbitrary slices of events. In the case where the trigger is rare you'll win by getting events one at a time on demand, but if it's not rare, then batching the reading probably helps. I guess we can come up with a heuristic for that.

fschlueter commented 1 year ago

Okay, I realise one can also only modify a few line in Dataset.py and does not have to touch the backends:

        if trigger is not None:
            for info, traces in self._iterate(start,stop,calibrated,max_entries_in_mem):
                if info.triggerType == trigger:
                    yield info, traces
        else:
            yield from self._iterate(start,stop,calibrated,max_entries_in_mem) 

But than actually the pyroot version does not get faster..

cozzyd commented 1 year ago

yeah, because the waveform loading is done even if you don't need it. I think the more invasive change is better. However, I can imagine another similar cut would be to cut on the trigger time, which would enable selecting cal pulsers that are synchronized to GPS. A more generic interface might be a callback on the eventInfo to decide whether or not to read the event?

For example, the interface can be

def iterate(self, start : int = 0, stop : Union[int,None] = None, calibrated: bool = False, max_entries_in_mem : int = 256, selector = Optional[Callable[EventInfo,bool]]) -> Generator[Optional[Tuple[EventInfo, numpy.ndarray]],None,None]: and you could then do something like d.iterate ( selector = lambda info : info.triggerType == 'Force' )

fschlueter commented 1 year ago

uh I agree, much nicer:)

fschlueter commented 1 year ago

here we go. Only minimal changes. Unfortunately I had trimming trailing whitespace activated... So it might be a bit cumbersome to see the changes... sorry

cozzyd commented 1 year ago

Feel free to merge when ready! Don't worry about the trailing spaces... I don't have my vim configured to remove them but I probably should.