Atkinson-Lab / Tractor

Scripts for implementing the Tractor pipeline
MIT License
44 stars 5 forks source link

NaNs for some variants in output #29

Closed samreenzafer closed 6 months ago

samreenzafer commented 7 months ago

I'm running a 3-ancestry tractor and I get NaN for some variants, and I've noticed its only for those variants, which have 1 or more populations with AF_ancX = 0

1) Can you explain if tractor is actually choosing not to calculate P-values for such positions, and why. 2) Can tractor not treat such positions as a regular globalized GWAS association?

I verified that when I run tractor on the same input, but only give 1 population hap and dosage files as input, I get Pvalue results for all those positions. This would mean that tractor is not doing a local ancestry aware GWAS when only 1 population dosages are provided, is it?

Example output: 3- populations based tractor output. anc0 = AFR, anc1 = EUR and anc2= AMR (My cohort is dominant in Afr and Eur) CHROM POS ID REF ALT AF_anc0 AF_anc1 AF_anc2 LAprop_anc0 LAprop_anc1 LAprop_anc2 LAeff_anc0 LAeff_anc1 LApval_anc0 LApval_anc1 Geff_anc0 Geff_anc1 Geff_anc2 Gpval_anc0 Gpval_anc1 Gpval_anc2 8 205821 . C T 0.03498 0.00047 0.00094 0.41455 0.53202 0.05342 0.5867933365769665 -0.275980166234514 0.0012209901508823107 0.15753876256506594 0.04648066093110761 -17.450806869947634 -18.175099461425173 0.863883930882097 0.9989229240723104 0.9995039199153428 8 206716 . G C 0.01773 0.00218 0.0066 0.41455 0.53202 0.05342 0.5843741523379564 -0.28597165650323303 0.001261827582320208 0.1432488601362374 0.055573211202005056 1.330571431049564 -17.54550391111194 0.8882255510926501 0.19749680296697525 0.9988180297161879 8 206747 . C G 0.01713 0.00218 0.0066 0.41455 0.53202 0.05342 0.5837735937670595 -0.2860281420231862 0.001276367249093116 0.14317199618539617 0.09385740867275061 1.3307863458279647 -17.54529555393314 0.8125228667403726 0.1974232992189573 0.9988180556108245 8 207242 . G A 0.04616 0.00019 0.0 0.41455 0.53202 0.05342 nan nan nan nan nan nan nan nan nan nan 8 208175 . T G 0.06486 9e-05 0.0 0.41455 0.53202 0.05342 nan nan nan nan nan nan nan nan nan nan 8 208227 . A C 0.03462 0.00047 0.00094 0.41455 0.53202 0.05342 0.58649948554488 -0.275949037607527 0.0012281548283880866 0.1575858179938755 0.05462753081609236 -17.450673605916016 -18.17479306605417 0.8410591302452655 0.9989229409247578 0.9995039282782405 8 209006 . G C 0.0611 0.00028 0.0 0.41455 0.53202 0.05342 nan nan nan nan nan nan nan nan nan nan 8 211365 . G A 0.0345 0.00047 0.00094 0.41455 0.53202 0.05342 0.5864994855448779 -0.27594903760753037 0.0012281548283881785 0.1575858179938706 0.05462753081608959 -17.450673605916016 -18.174793066054686 0.8410591302452743 0.9989229409247578 0.9995039282782406 8 211559 . A T 0.03547 0.00558 0.0 0.41455 0.53202 0.05342 nan nan nan nan nan nan nan nan nan nan 8 211635 . A C 0.00255 0.1308 0.03393 0.41455 0.53202 0.05342 0.5633672510332808 -0.27457186750119905 0.001879978203044346 0.16478825079831816 -21.427942353297524 -0.2360630636660421 -20.542693693652247 0.9994105894895378 0.4225715386127862 0.9992785348646407 8 211831 . A G 0.03437 0.00047 0.00094 0.41455 0.53202 0.05342 0.5863420974023827 -0.27594723308428054 0.0012318491446336143 0.1575886887838127 0.05907664030957955 -17.450584915698936 -18.174632779883186 0.8283067215504775 0.9989229507441875 0.9995039326531734 8 212134 . T C 0.06595 9e-05 0.0 0.41455 0.53202 0.05342 nan nan nan nan nan nan nan nan nan nan 8 212431 . G A 0.01628 0.0 0.0 0.41455 0.53202 0.05342 nan nan nan nan nan nan nan nan nan nan 8 212585 . C T 0.03243 0.00047 0.00094 0.41455 0.53202 0.05342 0.586727459572817 -0.27604712744082566 0.0012228544583496716 0.15743135595710156 0.04933957840521465 -17.450826454128542 -18.175186862413863 0.8606224880774684 0.9989229226291162 0.999503917529778 8 212596 . G A 0.03207 0.00047 0.00094 0.41455 0.53202 0.05342 0.5862340972984453 -0.27606973908617627 0.0012343370077641566 0.1573955498682199 0.0635453815950705 -17.450556446357133 -18.174733095827484 0.8210885360654389 0.9989229521486803 0.9995039299150978 8 212796 . G C 0.0351 0.00047 0.00094 0.41455 0.53202 0.05342 0.5872028540737119 -0.27596227263118217 0.0012117519959782528 0.15756547160577028 0.035074669225241245 -17.45105960063877 -18.175506282013092 0.8975416557249976 0.9989228975956651 0.999503908811391 8 213962 . T G 0.01385 0.00019 0.00094 0.41455 0.53202 0.05342 0.5818012611002261 -0.2746938967833825 0.0013349880275129622 0.1596021880992482 0.4904680546030868 -17.893114903301132 -18.167808800088824 0.18618597007533477 0.9993093215293589 0.9995041189102106 8 213988 . C G 0.02769 9e-05 0.0 0.41455 0.53202 0.05342 nan nan nan nan nan nan nan nan nan nan 8 214676 . C T 0.06413 9e-05 0.0 0.41455 0.53202 0.05342 nan nan nan nan nan nan nan nan nan nan 8 281474 . A G 0.0 0.00151 0.00094 0.41322 0.53319 0.05359 nan nan nan nan nan nan nan nan nan nan 8 294589 . A C 0.0 0.02304 0.00468 0.41298 0.53324 0.05378 nan nan nan nan nan nan nan nan nan nan

Running 1 population based tractor and showing same positions above. Anc0 = Afr CHROM POS ID REF ALT AF_anc0 LAprop_anc0 Geff_anc0 Gpval_anc0 8 205821 . C T 0.03498 1.0 0.6412522731228197 0.016018128393202897 8 206716 . G C 0.01773 1.0 0.6131118431244662 0.11886720160894383 8 206747 . C G 0.01713 1.0 0.6453075687149562 0.10098343499619976 8 207242 . G A 0.04616 1.0 0.12872366437608104 0.6611541190663489 8 208175 . T G 0.06486 1.0 0.9339493605726387 4.185585197316541e-07 8 208227 . A C 0.03462 1.0 0.6622026139834044 0.013408363791798283 8 209006 . G C 0.0611 1.0 0.3743355007576348 0.10392807669612138 8 211365 . G A 0.0345 1.0 0.6622026139834056 0.013408363791798042 8 211559 . A T 0.03547 1.0 0.6668868413362581 0.00735160056366909 8 211635 . A C 0.00255 1.0 -20.060908661749846 0.9990991633464656 8 211831 . A G 0.03437 1.0 0.6659527732795265 0.012891117266950454 8 212134 . T C 0.06595 1.0 0.9099151786072148 7.449635967596704e-07 8 212431 . G A 0.01628 1.0 1.273847584139422 1.1416550191617484e-05 8 212585 . C T 0.03243 1.0 0.6537431021225084 0.018000536254355784 8 212596 . G A 0.03207 1.0 0.6657178634342557 0.015993569517582372 8 212796 . G C 0.0351 1.0 0.6436369321530563 0.01624838163305082 8 213962 . T G 0.01385 1.0 1.0937502687294598 0.0028722620116477713 8 213988 . C G 0.02769 1.0 0.1295030868752873 0.7353893286125512 8 214676 . C T 0.06413 1.0 0.9386673906669657 3.2801637359264637e-07 8 281474 . A G 0.0 1.0 nan nan 8 294589 . A C 0.0 1.0 nan nan I'm also attaching these outputs as files here, since they may not render properly. github_issue_output_1pop.txt github_issue_output_3pops.txt

Can you help me understand, and guide how I could combine to get best of both results?

Thank You.

JasonTan-code commented 7 months ago

Hi, Yes, this is actually expected. For 3 way admixture, Tractor regress the phenotype via: lm(y ~ LA0 + LA1 + G0 + G1 + G2 + covariates), where LA represents local ancestries and G represents ancestry-specific risk allele counts. For some variants, we may have 0 AF for certain ancestries (e.g. no risk alleles appeared in anc0; in other words, G0 contains all zeros). This can cause the regression fail, and Tractor will yield a NA output.

You tried to only pass anc0.dosage.txt and anc0.hapcount.txt to tractor. This is essentially asking Tractor to compute lm(y ~ LA0 + G0 + covariates). This regression ignores the genetic contributions from anc1 and anc2 and can have a different interpretation to the original Tractor. I generally wouldn’t recommend this approach.

I noticed that you are probably running the Python implementation of Tractor. More recently, we have re-implemented Tractor in R, and it handles NAs more carefully. You may want to try this new implementation, and you should be able to find less NAs.

Hope that helps, Taotao

samreenzafer commented 7 months ago

Hi, Yes, this is actually expected. For 3 way admixture, Tractor regress the phenotype via: lm(y ~ LA0 + LA1 + G0 + G1 + G2 + covariates), where LA represents local ancestries and G represents ancestry-specific risk allele counts. For some variants, we may have 0 AF for certain ancestries (e.g. no risk alleles appeared in anc0; in other words, G0 contains all zeros). This can cause the regression fail, and Tractor will yield a NA output.

You tried to only pass anc0.dosage.txt and anc0.hapcount.txt to tractor. This is essentially asking Tractor to compute lm(y ~ LA0 + G0 + covariates). This regression ignores the genetic contributions from anc1 and anc2 and can have a different interpretation to the original Tractor. I generally wouldn’t recommend this approach.

I noticed that you are probably running the Python implementation of Tractor. More recently, we have re-implemented Tractor in R, and it handles NAs more carefully. You may want to try this new implementation, and you should be able to find less NAs.

Hope that helps, Taotao

Thank you for your reply. I will try out the R version of tractor.

But I still want to ask that if for a disease there is an effect of variants (from multiple ancestries), and variants dominant in 1 ancestry alone, what type of analyses would I use? Would tractor alone not be helpful in tackling this scenario. Or would I need to find a way to combine and then interpret associations from logistic associations (example plink/1.09 done on African subset of samples, Eurpoean subset of samples and Hipanic subset of samples) + tractor (where I have them all in 1 cohort). And before doing this, I'd have to first go through tractor output to find which variants are dominant in any one population.

I wonder if I'm getting the approach right?

JasonTan-code commented 7 months ago

I think you can still apply Tractor, even the AF for some ancestries are 0. The latest R implementation can try its best to compute the effect sizes and p values. However, in the case of AF = 1 or AF = 0 for one or more ancestries, linear/logistic model wouldn’t work for certain ancestries. This R implementation will drop these redundant variables. For instance, if you have two-way admixture, Tractor is essentially running lm(phe ~ LA0 + G0 + G1 + cov).

If G0 is 0 for all samples (AF for anc0 is 0), then R will drop G0, and compute lm(phe ~ LA0 + G1 + cov) for this variant.

If G0 = LA0 for all samples (AF for anc0 is 1), then R will drop G0, and compute lm(phe ~ LA0 + G1 + cov) for this variant.

I hope this clarifies.

samreenzafer commented 6 months ago

I tried the R version and it worked as you suggested. It gave NAs for the populations LA_pop that have 0 AF, but gave a plausible P-value for the population driving the variants.

The only drawback of the R implementation is that it is very memory intensive. It loads all the hap and dosage files for each input population, into memory first and then begins computation. So for example, I had 5 populations for chromosome 1 (the largest) and each hap and dosage file is ~20Gb. My script needed to be allocated 2052 = 200Gb memory for running it.

Would appreciate if you can try to make the R implementation less memory intensive, and also provide the Expected memroy usage on the Readme. I figure out the max memory usage only after multiple failures of my job.

Thank You again for replying.

JasonTan-code commented 6 months ago

Hi,

Thanks so much for the suggestion! Yeah, I agree that the R implementation could be further optimized to make it more efficient! We will work on this once we have a chance!

Thanks again for contacting us, Taotao

On Fri, Feb 16, 2024 at 1:02 PM samreenzafer @.***> wrote:

I tried the R version and it worked as you suggested. It gave NAs for the populations LA_pop that have 0 AF, but gave a plausible P-value for the population driving the variants.

The only drawback of the R implementation is that it is very memory intensive. It loads all the hap and dosage files for each input population, into memory first and then begins computation. So for example, I had 5 populations for chromosome 1 (the largest) and each hap and dosage file is ~20Gb. My script needed to be allocated 2052 = 200Gb memory for running it.

Would appreciate if you can try to make the R implementation less memory intensive, and also provide the Expected memroy usage on the Readme. I figure out the max memory usage only after multiple failures of my job.

Thank You again for replying.

— Reply to this email directly, view it on GitHub https://github.com/Atkinson-Lab/Tractor/issues/29#issuecomment-1949130975, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOARA4COH5UOPIW2KZ7GRWDYT6UMRAVCNFSM6AAAAABCM4IUICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBZGEZTAOJXGU . You are receiving this because you modified the open/close state.Message ID: @.***>