AI-sandbox / gnomix

A fast, scalable, and accurate local ancestry method.
Other
80 stars 13 forks source link

Insane memory usage #32

Open BEFH opened 2 years ago

BEFH commented 2 years ago

I have just had a gnomix run die after attempting to use more than 1.4 TB of memory. Yes, terabytes. These are unimputed GSA microarray data phased using eagle. I am fitting the model myself using the suggested microarray configuration, and for now, I am only calculating local ancestry on chromosome 17. Based on reference overlap even after generating the local model, it looks like I will either need to filter the reference before model generation or impute the data.

I suspect the issue is partially sample size. I have 31,705 samples in that cohort. I am also running it on a GDA cohort (10,859 samples, and another cohort of 13 samples, and it did not die on the smallest cohort. I have a couple of questions on how to optimize this:

Firstly, it appears that the model generation only uses the reference dataset and not the sample to which it will be applied. I wrote a script to compare the models generated with different datasets and they appear to be identical. Is that the case? I ran without calibration, so is that the case with calibration?

Secondly, is there any problem with first generating the models, then applying them to all of the different datasets? Do you have a recommendation for a minimal dataset to use for that for the model generation to happen as fast as possible?

guidebortoli commented 2 years ago

Hi @BEFH... Have you been able to workaround this? I am having the same problem running 888 samples and training my own model using 32 reference samples... Amount of memory used when crashing is around 132gb...

weekend37 commented 2 years ago

Hey @BEFH, sounds like you have a lot of data on your hands. Nice!

Are you attempting to use one of our trained models or are you training your own?

guidebortoli commented 2 years ago

@weekend37 hi, Do you know how can I circumvent this issue? Thanks

BEFH commented 2 years ago

@weekend37, I'm training my own. It's microarray, so I need to filter the reference to get good overlap. Doing this seems to help, along with splitting the target dataset into multiple sets of samples. I just want to be sure doing this is not hurting the quality of the reference or how good the accuracy estimates are.

arvind0422 commented 2 years ago

Hi @BEFH , Would you be able to share with us the config.yaml file that you are using (if you are using a custom config). If not, please let us know which default config you are using. We can explore some of the config options that lower your memory requirements. Best, Arvind

guidebortoli commented 2 years ago

For me, I ended up pruning my data (from 1.5kk markers), to 300k… And it worked…