Closed danieldjewell closed 3 years ago
Hi Daniel, It is kind of you to have done all of this research for the project. At the time I didn't look at it but Pandas did some changes to their CSV parser and I took another look today.
It is important to realize that speed is a combination of lots of factors, and while I have no doubt the HD5 format is very efficient, it does not appear that Pandas's interface to it is as performant as reading CSV files.
import pandas as pd
def load_old():
CRC_organic_data = pd.read_csv('/tmp/Physical Constants of Organic Compounds.csv', sep='\t', index_col=0)
%timeit load_old()
18.9 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
CRC_organic_data = pd.read_csv('/tmp/Physical Constants of Organic Compounds.csv', sep='\t', index_col=0)
CRC_organic_data.to_hdf('/tmp/example.h5', 'my_hdf_table', mode='w', format='table', complib='blosc:zstd', complevel=9 )#
%timeit pd.read_hdf('/tmp/example.h5', key='my_hdf_table')
CRC_organic_data.to_hdf('/tmp/example.h5', 'my_hdf_table', mode='w', format='table')#
%timeit pd.read_hdf('/tmp/example.h5', key='my_hdf_table')
CRC_organic_data.to_hdf('/tmp/example.h5', 'my_hdf_table', mode='w', format='fixed', complib='blosc:zstd', complevel=9 )#
%timeit pd.read_hdf('/tmp/example.h5', key='my_hdf_table')
CRC_organic_data.to_hdf('/tmp/example.h5', 'my_hdf_table', mode='w', format='fixed')#
%timeit pd.read_hdf('/tmp/example.h5', key='my_hdf_table')
34.7 ms ± 874 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 32 ms ± 1.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 26.6 ms ± 3.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 24.2 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
No matter the settings it seems to be considerably slower. The CSVs also have the benefit of having track-able changes in git. I am closing the issue because I am cleaning up and it seems certain to me that csv-style files read through Pandas is good enough.
Sincerely, Caleb
After reading #53, I couldn't agree with you more @CalebBell - this project is huge (in a great "lots of functionality" way but also definitely a pain to maintain way).
One thought as I was working on #60 was that loading the data with Pandas using text files might be slower than another storage option. A drop-in replacement would be to use
pd.HDFStore
- which can be compressed with several different options. A single compressed file could store native Pandas DataFrames (which is what you end up with after runningpd.read_csv()
anyway).This would be relatively easy:
pd.read_csv()
statements out into a "Make HDF" scriptSo, for example:
https://github.com/CalebBell/thermo/blob/9f4175a9bcebc450baeba0b4f62398686a804c27/thermo/phase_change.py#L51-L52
Would get moved to a make HDF script (see below) and replaced with something like:
This would be a particularly good improvement if any data transformation/calculation can be/needs to be done after the data is read in by
read_csv
: that extra calculation/processing can be moved to the "make HDF" script and then only the useful/ready dataframe needs to be stored.Making the HDF5 file is as simple as:
I'm not 100% sure that it would be faster - intuitively it seems like yes (the decompression speeds are really high) ... it would for sure be smaller and easier to manage deployment (potentially one data file instead of many small ones). And if there's any post-read processing that can be done ... definitely would improve the load time on that step.
There could also be other ways to store the data - however, since you're already using Pandas, this would be a 1:1 swapout since pd.HDFStore stores native Pandas DataFrames :grin: