apriha / snps

tools for reading, writing, merging, and remapping SNPs
BSD 3-Clause "New" or "Revised" License
98 stars 19 forks source link

Optimize normalized snps dataframe dtypes #108

Open apriha opened 3 years ago

apriha commented 3 years ago

Update the dtype of rsid, chrom, and genotype columns to be pandas.StringDtype as recommended here.

Also require pandas>1.0.0.

afaulconbridge commented 3 years ago

Have you thought about using CategoricalDtype for chrom and genotype ? See here

apriha commented 3 years ago

That's a great idea and will really help reduce memory usage for those columns.

And compared to object, it looks like StringDtype for the rsid column will also use less memory.

apriha commented 3 years ago

Note that in a quick test with one of the example files, s._snps.index = s._snps.index.astype(pd.StringDtype()) reduces memory usage by ~2.5 times (very desirable). However, just using .loc with an rsid label coerces the index back to object dtype (e.g., s.snps.loc["rs3094315"]).

It seems that to maintain the rsid column as pd.StringDtype(), either another method would have to be used to filter SNPs (e.g., s.snps.loc[s.snps.index == "rs3094315"]) (less convenient), or astype would have to be called after a .loc to convert the dtype back to pd.StringDtype() (uses more memory temporarily for when the dtype is object).

So, the following dtypes seem like a good trade-off between memory and convenience:

Column pandas dtype
rsid object
chrom pd.CategoricalDtype() (ordered after sorting chroms)
pos pd.UInt32Dtype()
genotype pd.CategoricalDtype()
apriha commented 3 years ago

Upon further investigation, it looks like object and pd.StringDtype() use the same amount of memory, and resetting the index dtype as above actually just freed the memory used by a hash table that was generated when label-based lookups were performed on the rsid index internal to snps, e.g., to determine the build. See this issue for explanation of the hash table behavior: https://github.com/pandas-dev/pandas/issues/31197 .

So, I think to be explicit, rsid should be pd.StringDtype() afterall.

The pandas issue provides ideas on how to prevent the hash table from being generated (e.g., only performing boolean indexing or not using rsid as the index internal to snps).