CoffeaTeam / coffea

Basic tools and wrappers for enabling not-too-alien syntax when running columnar Collider HEP analysis.
https://coffeateam.github.io/coffea/
BSD 3-Clause "New" or "Revised" License
131 stars 126 forks source link

Possible strange behaviour happening with chunking. #1142

Open jbrewster7 opened 2 months ago

jbrewster7 commented 2 months ago

Hello, I am using the coffea.nanoevents.NanoEventsFactory.from_root function from coffea.2024.5.0 and I am specifying chunking as defined in https://github.com/scikit-hep/uproot5/blob/v5.1.2/src/uproot/_dask.py#L109-L132 (as suggested in coffea). I am running this on lxplus with files in the eos folder using xrootd. I am running into something that I find odd, though may just be behaving differently than I expect.

Initially, I arbitrarily chose to have chunks of 10000 events (which is equivalent to about 16MB in the root file). This worked until I was working with a larger number of files. With more total files my RAM would fill up and my script would crash when computing using dask.compute(). When I used smaller chunks, my RAM would fill up and it would crash faster (the smaller I made the chunks the faster it would crash). I ended up having to increase my chunk size by 10 for it not to crash.

Could this be happening because when working with this small of chunks the amount of file i/o required overwhelms the RAM? Or is this possibly a bug in either coffea or uproot?

Thanks for your help!

lgray commented 2 months ago

Hello, can you post some of the code that causes this behavior? If you can isolate all this in a simple reproducer it'll help us identify the cause more quickly.

NJManganelli commented 2 months ago

I saw a similar behavior while doing the coffea-casa scale tests a few weeks ago. Very small chunksizes (initially a bug where i accidentally passed a O(100) number in as step size instead of steps_per_file), presumably small fractions of TBasket sizes, seem to lead to a serious struggle. Didn't follow up on that yet (and can't for the next couple weeks probably), but intended to scan over it for v1.1 of my simple-benchmark code