Performance on huge files

cta-observatory / pycorsikaio

Python reader for CORSIKA binary file format

MIT License

9 stars 2 forks source link

Performance on huge files #26

Closed HealthyPear closed 11 months ago

HealthyPear commented 1 year ago

I am dealing with ~135 Gb particle files and wondering about the best way to work with them.

The code I used is the following,

from corsikaio import CorsikaParticleFile

input_file = "[....]/DAT100001"

with CorsikaParticleFile(input_file) as f:
    for event in f:
        if event.header['event_number']==2:
            break
        pass

I then used cProfile to produce the following profile file,

test_pycorsikaio_simplest.prof.zip

which can be opened with e.g. Snakeviz.

My ideal solution would be to read the file in multi-threaded chunks, but given this is Corsika I am not sure if and how it can be done.

maxnoe commented 1 year ago

Just to comment here what I also wrote in slack:

The main issue is that we read the files 273 bytes at a time, resulting in a lot of system calls that switch between userland and kernel space.

This could be much improved by reading the data in larger chunks. E.g. some largish multiple of 273 for files without the "buffer size" and the actual buffer size for files that do have one.

I can't promise when I will be able to try that out, feel free to try yourself and open a PR.

I doubt multi-threading will help much here, CORSIKA files are sequential and you need to look for the markers in the first 4 bytes of every chunk (RUNH / EVTH / EVTE / LONGI / RUNE)

HealthyPear commented 1 year ago

In this regard, but also for simpler use cases, what about adding some computing benchmarks to the CI using files using git lfs?

maxnoe commented 11 months ago

The main issue here was addressed: we now read much larger blocks than 273 bytes from the filesystem and this has resulted in a speedup: #29

I am closing this. If performance is still an issue, please provide profiling information in a new issue.