Quasars / orange-spectroscopy

Other
52 stars 58 forks source link

Speed up readers #532

Open borondics opened 3 years ago

borondics commented 3 years ago

We use numpy.loadtxt in a lot of places and there are faster solutions. Pandas for example can be significantly faster. @markotoplak, @stuart-cls do you think we should switch over to Pandas when loading the data?

In [3]: %time np.loadtxt('Hermes_y_136.csv', delimiter=',')
CPU times: user 20.7 s, sys: 1.13 s, total: 21.8 s
Wall time: 22.3 s

In [4]: %time pd.read_csv('Hermes_y_136.csv', delimiter=',')
CPU times: user 4.49 s, sys: 343 ms, total: 4.83 s
Wall time: 4.91 s

The file in this case was ~190 MB, which is a normal FPA image.

borondics commented 3 years ago

I discovered that np.loadtxt has an advantage at small file sizes and pd.read_csv is faster for large ones. In other words, the loading time for NumPy is linear with the file size while it is not with Pandas.

The crossover is around 1MB, so this brings an interesting question as individual files are usually below the MB limit. If we want to speed up large files we definitely should switch, but this would set us back when loading series of small files with Multifile. I still need to test what this would mean for us.

The figure below is done in pure python, not through the Quasar loaders. Then we get some overhead, which would be also interesting to investigate and decrease.

image