Open m-makarious opened 2 years ago
I am still troubleshooting a faster way to optimize this - as I've tried to convert explicitly to a NumPy array and create an array of booleans to try to speed up this process.
g_pruned_np = g_pruned.to_numpy()
g_pruned_np2 = g_pruned_np.astype(dtype=np.int32)
two_idx = (g3_pruned_np == 2)
# etc...
and have tried to set copy=False
to avoid making a copy of the array and try to limit space.
However, this did not really speed up the process as intended - still needs some looking into
https://github.com/GenoML/genoml2/blob/8040f2b1b460cc6085527e5fd65963518459cd11/genoml/preprocessing/munging.py#L169
After some troubleshooting, it seems as though in tandem with the pandas_plink package, this line is slow and uses much more memory than the rest of the munging process (benchmarks show about up to 10-20x if not more than all previous steps):
This issue is problematic because large datasets might prematurely get killed unless you give generous space and time allocation (not suitable for most local computers).