RNO-G / mattak

RNO-G dataformats
1 stars 1 forks source link

header/eventInfo information much slower to read than expected #19

Closed sjoerd-bouma closed 1 year ago

sjoerd-bouma commented 1 year ago

Just reading header information ('eventInfo') is much slower than expected right now. It is also significantly slower with the pyroot backend than with the uproot backend; maybe it's unintentionally reading the waveforms as well?

import numpy as np
import uproot
import mattak.Dataset
import time

mattakreader = mattak.Dataset.Dataset(
    11, 1101, 
    # data_dir ='/mnt/pnfs/inbox/',
    data_dir='/home/sb/Python/scratch/data/inbox',
    backend='pyroot'
)
t0 = time.time()
mattakreader.setEntries((0, None))
mattakreader.eventInfo()
print(f'mattak.Dataset (pyroot) took {time.time()-t0:.1f} s')

mattakreader = mattak.Dataset.Dataset(
    11, 1101, 
    # data_dir ='/mnt/pnfs/inbox/',
    data_dir='/home/sb/Python/scratch/data/inbox',
    backend='uproot'
)

t0=time.time()
mattakreader.setEntries((0, None))
mattakreader.eventInfo()
print(f'mattak.Dataset (uproot backend) took {time.time()-t0:.1f} s')

t0=time.time()
f = uproot.open('/home/sb/Python/scratch/data/inbox/station11/run1101/combined.root')
hdr = f['combined']['header']
_ = hdr['event_number'].array(library='np')
print(f'uproot took {time.time()-t0:.3f} s')

output

mattak.Dataset (pyroot) took 17.2 s
mattak.Dataset (uproot backend) took 1.2 s
uproot took 0.040 s
fschlueter commented 1 year ago

@cozzyd

cozzyd commented 1 year ago

first observations:

Profiling, the majority of time seems to be spent in LZMA decoding. I'm guessing the access pattern for some reason is causing trouble in the combined fiel, but will have to look into it more. I bet the difference between "uproot backend" and "uproot" is just the marshalling into the EventInfo class (which is done in Python), but I would have thought that both backends would use the implementation in liblzma? Maybe that's not true... (certainly there's a difference because the pyroot backend is doing one at once, rather than all at once here).

But then why the difference between the combined and full files?

cozzyd commented 1 year ago

ok I know what's going on. It has to do with the fact that the combined is one tree and we don't cleverly SetBranchStatus() when fetching just eventinfos. uproot must do the equivalent of that somehow automatically. I think I know how to fix...

cozzyd commented 1 year ago

ok, should be fixed now so it's not being dumb in mattak::Dataset.

I updated the benchmark (and committed it ) slightly, including having uproot load everything that goes into the EventInfo (and disabling reading status/runinfo). I also added mattak::Dataset (pyroot) which doesn't convert to EventInfo to estimate that overhead as well as mattak::Dataset (cling), which calls the cling C++ interpreter to measure the Python->C++ overhead (since we cross that boundary for every entry the way we use the pyroot backend, rather than doing it once per call the way uproot does).

now getting this on the same run:

nentries: 5028
mattak.Dataset (pyroot backend) took 0.080 s
mattak::Dataset (pyroot) took 0.030 s
mattak::Dataset (cling) took 0.022 s
mattak.Dataset (uproot backend) took 2.115 s
uproot took 0.151 s

and for a full dataset:

export RNO_G_DATA=/data/handcarry22/rootified
$ python3 tests/benchmark_dataset.py 
nentries: 102558
mattak.Dataset (pyroot backend) took 1.296 s
mattak::Dataset (pyroot) took 0.317 s
mattak::Dataset (cling) took 0.021 s
mattak.Dataset (uproot backend) took 39.720 s
uproot took 0.150 s

So there is certainly a bit of overhead from converting to EventInfo, a lot worse for uproot than for pyroot (seems the element-wise access to what is returned from array() is expensive...), and the pyroot backend would also be somewhat faster if there was a bulk access interface in C++ (to avoid crossing language boundaries so many times).

attached is the py-spy profile for a full run (not this is py3.6 which is running on the rno-g server... I bet a newer python might be less slow...)

note that you probably have to download the svg and open it in your browser rather than the embedded here so that it works properly, at least on firefox... python3-2023-04-27T21:03:28-05:00

fschlueter commented 1 year ago

Interesting... Do you think working with a comprehensive list in uproot i.e.

 event_infos = [EventInfo(x, y, z) for x, y, z in zip(...)]
 for event_info in event_infos:
      yield event_info

would make it faster?

fschlueter commented 1 year ago

Interesting... Do you think working with a comprehensive list in uproot i.e.

event_infos = [EventInfo(x, y, z) for x, y, z in zip(...)] for event_info in event_infos: yield event_info would make it faster?

Okay not really... However if we comment the block

    if triggerInfo['trigger_info.radiant_trigger']:
        which = triggerInfo['trigger_info.which_radiant_trigger']
        if which == -1: 
            which = "X" 
        triggerType = "RADIANT" + str(which)
    elif triggerInfo['trigger_info.lt_trigger']:
        triggerType = "LT"
    elif triggerInfo['trigger_info.force_trigger']:
        triggerType = "FORCE"
    elif triggerInfo['trigger_info.pps_trigger']:
        triggerType = "PPS"

we win ~ 1 / 7 of the run time...

fschlueter commented 1 year ago

BTW, I think it would be a great feature if we could do dataset.eventInfo().get_event_numbers() or similar an we get all event numbers. This interface would be super quick in uproot and could be probably made reasonably quick in pyroot?

sjoerd-bouma commented 1 year ago

Thanks Cosmin! A very quick fix. As far as I'm concerned, O(1) s for 100k events is plenty fast, I don't think we need to optimize this more. I'll close the issue, feel free to reopen it if you disagree.

@fschlueter - what's wrong with e.g.

mattakreader.setEntries((0, None))
eventinfos = mattakreader.eventInfo()
event_numbers = np.array([i.eventNumber for i in eventinfos])

The last line is practically instantaneous in comparison to the second one. I agree we probably need it internally in the NuRadioMC reader to facilitate event selection, but I don't think it's necessary to have it in mattak. But I'll leave that up to you.

fschlueter commented 1 year ago

Well I thought that if we internally save data (such as event numbers) as arrays we radically improve the speed of the uproot backend. E.g., if we store:

self._event_number = self._hds['event_number'].array(**kw)

and implement a getter of self._event_number via the EventInfo class we avoid the bottleneck in the uproot backend for this use case (when we are interested in a particular information for all events).

fschlueter commented 1 year ago

The pyroot backend will not improve for that, but it should also not harm right? Memory management wise it might be a disadvantage ...

cozzyd commented 1 year ago

there are many ways to accomplish the same thing in the pyroot backend too :) for example ROOT.RDataFrame(d.headTree()).AsNumpy("event_number")   or something like that probably works?

On 4/29/23 13:15, Felix Schlüter wrote:

The pyroot backend will not improve for that, but it should also not harm right? Memory management wise it might be a disadvantage ...

— Reply to this email directly, view it on GitHub https://github.com/RNO-G/mattak/issues/19#issuecomment-1528843613, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACGHWKOMPASNKYUUBT6PRD3XDVLF7ANCNFSM6AAAAAAXMSH42A. You are receiving this because you were mentioned.Message ID: @.***>

fschlueter commented 1 year ago

great. I think it is worth introduce something like that. This will always allow faster event filtering