cedadev / nappy

NASA Ames Processing in PYthon (NAPPy) - a Python library for reading, writing and converting NASA Ames files.
BSD 3-Clause "New" or "Revised" License
9 stars 13 forks source link

Performance: readData utterly slow for files with many lines of data #57

Open FObersteiner opened 2 years ago

FObersteiner commented 2 years ago

Description

Loading data from small files completes in a decent amount of time. With many lines of data (10k+), the process becomes a "bottleneck".

What I Did

read 4.3k lines of data, ffi1001:

%timeit myfile.readData()
67.9 ms ± 7.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

read 86.6k lines of data, ffi1001:

%timeit myfile.readData()
51.5 s ± 2.54 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

That's nearly a minute per file! If I'd want to load many such files, I'd have to go have a lot of coffee in the meantime ☕👾


tracing the execution of the call to readData, I find

agstephens commented 2 years ago

@FObersteiner, I agree that we should look at this. Do you have publicly downloadable large example files that we could use in unit/integration testing?

FObersteiner commented 2 years ago

@agstephens jup, I was about to create some public sample data from our ozone instruments anyway ;-) you can find them here: https://git.scc.kit.edu/FObersteiner/pyFairoproc/-/tree/master/samples.

The one that's problematic in this context (nappy reading data) is the cl_photometer file (~86k lines of data, just one variable).