Optimize normalized snps dataframe dtypes

apriha commented 3 years ago

Update the dtype of rsid, chrom, and genotype columns to be pandas.StringDtype as recommended here.

Also require pandas>1.0.0.

afaulconbridge commented 3 years ago

Have you thought about using CategoricalDtype for chrom and genotype ? See here

apriha commented 3 years ago

That's a great idea and will really help reduce memory usage for those columns.

And compared to object, it looks like StringDtype for the rsid column will also use less memory.

apriha commented 3 years ago

Note that in a quick test with one of the example files, s._snps.index = s._snps.index.astype(pd.StringDtype()) reduces memory usage by ~2.5 times (very desirable). However, just using .loc with an rsid label coerces the index back to object dtype (e.g., s.snps.loc["rs3094315"]).

It seems that to maintain the rsid column as pd.StringDtype(), either another method would have to be used to filter SNPs (e.g., s.snps.loc[s.snps.index == "rs3094315"]) (less convenient), or astype would have to be called after a .loc to convert the dtype back to pd.StringDtype() (uses more memory temporarily for when the dtype is object).

So, the following dtypes seem like a good trade-off between memory and convenience:

Column	pandas dtype
rsid	object
chrom	pd.CategoricalDtype() (ordered after sorting chroms)
pos	pd.UInt32Dtype()
genotype	pd.CategoricalDtype()

apriha commented 3 years ago

Upon further investigation, it looks like object and pd.StringDtype() use the same amount of memory, and resetting the index dtype as above actually just freed the memory used by a hash table that was generated when label-based lookups were performed on the rsid index internal to snps, e.g., to determine the build. See this issue for explanation of the hash table behavior: https://github.com/pandas-dev/pandas/issues/31197 .

So, I think to be explicit, rsid should be pd.StringDtype() afterall.

The pandas issue provides ideas on how to prevent the hash table from being generated (e.g., only performing boolean indexing or not using rsid as the index internal to snps).

apriha / snps

Optimize normalized snps dataframe dtypes #108