Data Loading/Storage - Githubissues

After reading #53, I couldn't agree with you more @CalebBell - this project is huge (in a great "lots of functionality" way but also definitely a pain to maintain way).

One thought as I was working on #60 was that loading the data with Pandas using text files might be slower than another storage option. A drop-in replacement would be to use pd.HDFStore - which can be compressed with several different options. A single compressed file could store native Pandas DataFrames (which is what you end up with after running pd.read_csv() anyway).

This would be relatively easy:

Pull the pd.read_csv() statements out into a "Make HDF" script
Write the HDF compressed datastore
Open the HDF store (preferably once, although I'm not entirely sure about the best way to share one instance across multiple modules....)
Access the data you want by the key set when you created it... (HDF has groups and can kinda simulate directory structure so the keys could be the stored filenames...

So, for example:

https://github.com/CalebBell/thermo/blob/9f4175a9bcebc450baeba0b4f62398686a804c27/thermo/phase_change.py#L51-L52

Would get moved to a make HDF script (see below) and replaced with something like:

#Preferably share this 
hdf = pd.HDFStore('data.h5', mode='r')
#..........

TM_ON_Data = hdf['/Phase_Change/OpenNotebook_Melting_Points']
# TM_ON_Data is now a pandas dataframe just like what's returned from pd.read_csv()

This would be a particularly good improvement if any data transformation/calculation can be/needs to be done after the data is read in by read_csv: that extra calculation/processing can be moved to the "make HDF" script and then only the useful/ready dataframe needs to be stored.

Making the HDF5 file is as simple as:


#multiple compressors available for complib
hdf = pd.HDFStore('hdfname.h5', mode='w', complib='blosc:zstd', complevel=9)

df = pd.read_csv(...)
#optional transformation
df = df.reindex(xyz)
hdf.append('/Phase/CRCPhaseTable1', df)

hdf.flush()
hdf.close()

I'm not 100% sure that it would be faster - intuitively it seems like yes (the decompression speeds are really high) ... it would for sure be smaller and easier to manage deployment (potentially one data file instead of many small ones). And if there's any post-read processing that can be done ... definitely would improve the load time on that step.

There could also be other ways to store the data - however, since you're already using Pandas, this would be a 1:1 swapout since pd.HDFStore stores native Pandas DataFrames :grin:

Hi Daniel, It is kind of you to have done all of this research for the project. At the time I didn't look at it but Pandas did some changes to their CSV parser and I took another look today.

It is important to realize that speed is a combination of lots of factors, and while I have no doubt the HD5 format is very efficient, it does not appear that Pandas's interface to it is as performant as reading CSV files.

import pandas as pd
def load_old():
    CRC_organic_data = pd.read_csv('/tmp/Physical Constants of Organic Compounds.csv', sep='\t', index_col=0)
%timeit load_old()

18.9 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

CRC_organic_data = pd.read_csv('/tmp/Physical Constants of Organic Compounds.csv', sep='\t', index_col=0)

CRC_organic_data.to_hdf('/tmp/example.h5', 'my_hdf_table', mode='w', format='table', complib='blosc:zstd', complevel=9 )#
%timeit pd.read_hdf('/tmp/example.h5', key='my_hdf_table')

CRC_organic_data.to_hdf('/tmp/example.h5', 'my_hdf_table', mode='w', format='table')#
%timeit pd.read_hdf('/tmp/example.h5', key='my_hdf_table')

CRC_organic_data.to_hdf('/tmp/example.h5', 'my_hdf_table', mode='w', format='fixed', complib='blosc:zstd', complevel=9 )#
%timeit pd.read_hdf('/tmp/example.h5', key='my_hdf_table')

CRC_organic_data.to_hdf('/tmp/example.h5', 'my_hdf_table', mode='w', format='fixed')#
%timeit pd.read_hdf('/tmp/example.h5', key='my_hdf_table')

34.7 ms ± 874 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) 32 ms ± 1.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 26.6 ms ± 3.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 24.2 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

No matter the settings it seems to be considerably slower. The CSVs also have the benefit of having track-able changes in git. I am closing the issue because I am cleaning up and it seems certain to me that csv-style files read through Pandas is good enough.

Sincerely, Caleb

CalebBell / thermo

Data Loading/Storage #61