Performance: readData utterly slow for files with many lines of data

FObersteiner commented 2 years ago

Description

Loading data from small files completes in a decent amount of time. With many lines of data (10k+), the process becomes a "bottleneck".

What I Did

read 4.3k lines of data, ffi1001:

%timeit myfile.readData()
67.9 ms ± 7.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

read 86.6k lines of data, ffi1001:

%timeit myfile.readData()
51.5 s ± 2.54 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

That's nearly a minute per file! If I'd want to load many such files, I'd have to go have a lot of coffee in the meantime ☕👾

tracing the execution of the call to readData, I find

this while loop calls internal read method(s) for each line of data (ok)
for NA1001, this method calls the parser readItemsFromUnknownLines (ok)
that parser has multiple conditionals, while loops etc. (nok ?)
it checks for curly braces in each line of data with a regex (nok ?)

agstephens commented 2 years ago

@FObersteiner, I agree that we should look at this. Do you have publicly downloadable large example files that we could use in unit/integration testing?

FObersteiner commented 2 years ago

@agstephens jup, I was about to create some public sample data from our ozone instruments anyway ;-) you can find them here: https://git.scc.kit.edu/FObersteiner/pyFairoproc/-/tree/master/samples.

The one that's problematic in this context (nappy reading data) is the cl_photometer file (~86k lines of data, just one variable).

cedadev / nappy

Performance: readData utterly slow for files with many lines of data #57

Description

What I Did