AI-sandbox / XGMix

13 stars 2 forks source link

Does XGMix work with imputed data? #7

Closed gglab21 closed 3 years ago

gglab21 commented 3 years ago

Hello,

Does XGMix work with imputed data as the query file? I am currently getting: 'initializing apriori reference subpop across CRF... Failed allocating memory at load-input.cpp:680 (8.2 Mb)' I am wondering if this is attributed to dosage formatted vcf file?

Any help would be appreciated. Thanks!

weekend37 commented 3 years ago

Hi there!

It sounds a bit like you were running RFMix instead of XGMix, is that possible? There are no CRFs or .cpp files in this repository.

To answer the question, XGMix shouldn't have any issues with processing imputed data. However, since imputation is usually done without leveraging any ancestry information, too much of it can obscure the signal and lead to worse inference.

gglab21 commented 3 years ago

Sorry, the error code is from RFMix. I went deep down a rabbit hole trying to find a resolution to my problem and confused the terminals. From my understanding, when training a model from scratch XGMix uses RFMix's simulation algorithm to generate the training data. During the simulation step, XGMix is 'Killed' with no other error message. I was running RFMix to help debug. Here is the output from XGMix:

Launching XGMix in train mode... Reading sample maps and splitting in train/val... Running simulation... Fast admix... File read: 1100291 SNPs for 2504 individuals Building founders Simulating... Simulating generation 2 Simulating generation 4 Simulating generation 6 Simulating generation 8 Simulating generation 12 Simulating generation 16 Killed Essentially, I am attempting to create a training model using the GRCh38 build of the 1000G data. I did not see this build in the models you provided. Do you plan on releasing training models for this build?

Thanks

weekend37 commented 3 years ago

Hi again,

Yes, thanks. This makes more sense. It most certainly is a memory error. May I ask if you're running this on a local machine?

You can try changing the following parameters in config.py:

r_admixed = 3.0                   # (instead of 10.0)
generations = [0, 4, 8, 16, 32].  # (instead of [0, 2, 4, 6, 8, 12, 16, 24, 32, 48])

That should reduce the memory requirement by a factor of ~7 without hurting the performance by much. Let’s revisit if that does not work.

Regarding the trained models, We will publish trained models for build 38 eventually, yes. When is to be determined but it helps to know that there is demand. This will be before summer.

gglab21 commented 3 years ago

No, I am running this through an SGE Cluster which has around 256Gb of memory per node, so memory normally isn't an issue. I was able to have chr22 run successfully with the new parameters in the config file, but all other chromosomes still fail.

Traceback (most recent call last): File "XGMIX.py", line 368, in <module> instance_name=instance_name, mode_filter_size=mode_filter_size, smooth_depth=smooth_depth) File "XGMIX.py", line 249, in main args.reference_file, args.genetic_map_file, num_outs_per_gen, generations) File "/XGMix/Admixture/fast_admix.py", line 47, in main_admixture_fast random_seed=94305,verbose=verbose) File "/XGMix/pyadmix/admix.py", line 37, in simulate sample_map_data) File "/XGMix/pyadmix/utils.py", line 162, in build_founders paternal["anc"] = np.array([i[1]["population_code"]]*chm_length_snps) MemoryError: Unable to allocate array with shape (7074549,) and data type int64

Do you have any other suggestions? The query file is already filtered down to a MAF>0.005 and R2>0.3. Does it need to be more stringent? Any help is appreciated.

Thanks

weekend37 commented 3 years ago

Hi again - thanks for the reply and sorry for my late one.

I don't think the problem is with your filtering. We've been working on reducing the memory need with the fast simulation and we'll get back to you once we've updated the code. Shouldn't be long until we do so.

guidebortoli commented 3 years ago

Sorry, the error code is from RFMix. I went deep down a rabbit hole trying to find a resolution to my problem and confused the terminals. From my understanding, when training a model from scratch XGMix uses RFMix's simulation algorithm to generate the training data. During the simulation step, XGMix is 'Killed' with no other error message. I was running RFMix to help debug. Here is the output from XGMix:

Launching XGMix in train mode... Reading sample maps and splitting in train/val... Running simulation... Fast admix... File read: 1100291 SNPs for 2504 individuals Building founders Simulating... Simulating generation 2 Simulating generation 4 Simulating generation 6 Simulating generation 8 Simulating generation 12 Simulating generation 16 Killed Essentially, I am attempting to create a training model using the GRCh38 build of the 1000G data. I did not see this build in the models you provided. Do you plan on releasing training models for this build?

Thanks

Hi @gglab21 ....I'm also trying to train a model from scratch with the GRCh38 build, but for the HGDP data... To reduce my time, I'm testing the chr22... I'm wondering if you have found a reliable genetic map for the GRCh38/hg38 build? Thanks

gglab21 commented 3 years ago

Hi @gglab21 ....I'm also trying to train a model from scratch with the GRCh38 build, but for the HGDP data... To reduce my time, I'm testing the chr22... I'm wondering if you have found a reliable genetic map for the GRCh38/hg38 build? Thanks

Hi @guidebortoli I used the genetic map files that are used for Eagle phasing by Broad and the Michigan Imputation Server. The links below should get you where you need to go: https://imputationserver.readthedocs.io/en/latest/create-reference-panels/ https://data.broadinstitute.org/alkesgroup/Eagle/downloads/tables/

If you also need the centromeres, I pulled them from UCSC: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/centromere.txt.gz