Munging Hang-up - Scalability issue with large files and pandas_plink conversion

GenoML / genoml2

GenoML (genoml2) is an open source Python package. It is an automated machine learning (autoML) platform for genomics data

Apache License 2.0

27 stars 17 forks source link

Munging Hang-up - Scalability issue with large files and pandas_plink conversion #31

Open m-makarious opened 2 years ago

m-makarious commented 2 years ago

https://github.com/GenoML/genoml2/blob/8040f2b1b460cc6085527e5fd65963518459cd11/genoml/preprocessing/munging.py#L169

After some troubleshooting, it seems as though in tandem with the pandas_plink package, this line is slow and uses much more memory than the rest of the munging process (benchmarks show about up to 10-20x if not more than all previous steps):

g_pruned.values = g_pruned.values.astype('int')

This issue is problematic because large datasets might prematurely get killed unless you give generous space and time allocation (not suitable for most local computers).

m-makarious commented 2 years ago

I am still troubleshooting a faster way to optimize this - as I've tried to convert explicitly to a NumPy array and create an array of booleans to try to speed up this process.

g_pruned_np = g_pruned.to_numpy()
g_pruned_np2 = g_pruned_np.astype(dtype=np.int32)
two_idx = (g3_pruned_np == 2)
# etc...

and have tried to set copy=False to avoid making a copy of the array and try to limit space.

However, this did not really speed up the process as intended - still needs some looking into