Open francois-a opened 6 years ago
Here are a few benchmarks (in python, using %timeit) based on the GTEx V7 gene-level read counts matrix, available from the GTEx Portal (GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_reads.gct.gz).
Conversion to other formats:
base_name = 'GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_reads' ref_df.to_parquet(base_name+'.parquet') feather.write_dataframe(ref_df.reset_index(), base_name+'.feather') ref_df.to_hdf(base_name+'.hdf_comp6', 'counts_df', complevel=6) ref_df.to_hdf(base_name+'.hdf', 'counts_df')
Read times for the full table (56202 rows x 11689 columns):
# gct.gz 496M %timeit ref_df = pd.read_csv(base_name+'.gct.gz', sep='\t', skiprows=2, index_col=0) 3min 32s +- 2.31 s per loop (mean +- std. dev. of 7 runs, 1 loop each) # hdf5 (no compression) 2.5G %timeit ref_df = pd.read_hdf(base_name+'.hdf') 1.17 s +- 27.3 ms per loop (mean +- std. dev. of 7 runs, 1 loop each) # hdf5 (zlib, level 6) 408M %timeit ref_df = pd.read_hdf(base_name+'.hdf_comp6') 12.2 s +- 12.9 ms per loop (mean +- std. dev. of 7 runs, 1 loop each) # parquet (snappy compression) 1.1G %timeit ref_df = pd.read_parquet(base_name+'.parquet') 6.2 s +- 21.3 ms per loop (mean +- std. dev. of 7 runs, 1 loop each) # feather (no compression) 2.5G %timeit ref_df = feather.read_dataframe(base_name+'.feather').set_index('Name') 3.09 s +- 26.8 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)
Read times for a single column:
c = ['GTEX-11XUK-0926-SM-5EQL3'] # gct.gz %timeit ref_df = pd.read_csv(base_name+'.gct.gz', sep='\t', skiprows=2, usecols=['Name']+c, index_col=0) 26.4 s +- 4.16 ms per loop (mean +- std. dev. of 7 runs, 1 loop each) # hdf5: N/A # parquet %timeit ref_df = pd.read_parquet(base_name+'.parquet', columns=c) 195 ms +- 1.92 ms per loop (mean +- std. dev. of 7 runs, 1 loop each) # feather %timeit ref_df = feather.read_dataframe(base_name+'.feather', columns=['Name']+c).set_index('Name') 18.5 ms +- 136 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
For some useful information on differences between feather and parquet, see https://stackoverflow.com/a/48097717
Here are a few benchmarks (in python, using %timeit) based on the GTEx V7 gene-level read counts matrix, available from the GTEx Portal (GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_reads.gct.gz).
Conversion to other formats:
Read times for the full table (56202 rows x 11689 columns):
Read times for a single column:
For some useful information on differences between feather and parquet, see https://stackoverflow.com/a/48097717