HumanCellAtlas / table-testing

requirements, examples, and tests for expression matrix file formats
MIT License
22 stars 3 forks source link

Benchmarks with GTEx data #2

Open francois-a opened 6 years ago

francois-a commented 6 years ago

Here are a few benchmarks (in python, using %timeit) based on the GTEx V7 gene-level read counts matrix, available from the GTEx Portal (GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_reads.gct.gz).

Conversion to other formats:

base_name = 'GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_reads'
ref_df.to_parquet(base_name+'.parquet')
feather.write_dataframe(ref_df.reset_index(), base_name+'.feather')
ref_df.to_hdf(base_name+'.hdf_comp6', 'counts_df', complevel=6)
ref_df.to_hdf(base_name+'.hdf', 'counts_df')

Read times for the full table (56202 rows x 11689 columns):

# gct.gz 496M
%timeit ref_df = pd.read_csv(base_name+'.gct.gz', sep='\t', skiprows=2, index_col=0)
3min 32s +- 2.31 s per loop (mean +- std. dev. of 7 runs, 1 loop each)

# hdf5 (no compression) 2.5G
%timeit ref_df = pd.read_hdf(base_name+'.hdf')
1.17 s +- 27.3 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)

# hdf5 (zlib, level 6) 408M
%timeit ref_df = pd.read_hdf(base_name+'.hdf_comp6')
12.2 s +- 12.9 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)

# parquet (snappy compression) 1.1G
%timeit ref_df = pd.read_parquet(base_name+'.parquet')
6.2 s +- 21.3 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)

# feather (no compression) 2.5G
%timeit ref_df = feather.read_dataframe(base_name+'.feather').set_index('Name')
3.09 s +- 26.8 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)

Read times for a single column:

c = ['GTEX-11XUK-0926-SM-5EQL3']

# gct.gz
%timeit ref_df = pd.read_csv(base_name+'.gct.gz', sep='\t', skiprows=2, usecols=['Name']+c, index_col=0)
26.4 s +- 4.16 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)

# hdf5: N/A

# parquet
%timeit ref_df = pd.read_parquet(base_name+'.parquet', columns=c)
195 ms +- 1.92 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)

# feather
%timeit ref_df = feather.read_dataframe(base_name+'.feather', columns=['Name']+c).set_index('Name')
18.5 ms +- 136 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

For some useful information on differences between feather and parquet, see https://stackoverflow.com/a/48097717