AlexTISYoung / snipar

Imputation of parental genotypes, inference of sibling IBD segments, family based GWAS, and polygenic score analyses.
MIT License
24 stars 5 forks source link

Issue with Calculated Scores and Effects in snipar #31

Closed PandaPowell closed 1 year ago

PandaPowell commented 1 year ago

Hi there,

First of all, I'd like to express my appreciation for creating the snipar tool – it's been incredibly valuable in my research.

However, I've encountered some discrepancies when using snipar to calculate PGIs using the pgs.py script. Specifically, I've noticed discrepancies in the calculated scores and direct/indirect effects when working with non-imputed complete trios compared to results from PLINK and PRSice2 using the same input files.

When using snipar's pgs.py script to compute scores and effects for fathers and mothers within complete trios, I've observed unusual results, particularly with fathers having significantly larger indirect effects compared to mothers. In contrast, when constructing PGIs using PLINK and PRSice2 with identical input files, the resulting scores and direct/indirect effects seem to align more sensibly.

I am currently looking into solutions but was hoping you could point me in the right direction. Is it expected to observe such variations between snipar and PLINK/PRSice2?

I appreciate the help.

Below is an excerpt of the code I've been using as well as the results.

/opt/software/versions/Python-3.8.3/bin/python3.8 ~/bin/pgs.py ./PGS/nonImp_direct \ --bed /home/r055442/GenR_trio_duos/European/Unphased_plink/EU_trio_duos_qc \ --weights TB_snps_GWS.txt \ --pedigree king_pedigree.txt \ --SNP SNP \ --beta_col b \ --A1 A1 \ --A2 A2 \ --no_am_adj

snipar results:

intercept -0.3184900915503549 0.4213717452056926 AGECHILD13 0.6187514297850482 0.065049096367718 GENDER 0.5119690037751246 0.0456294872363309 PC1 58.295617419578875 89.89010678220191 PC2 14.199614550388121 16.506109847826696 PC3 -11.658331730011618 16.342479733621758 PC4 -3.4131219226976306 10.229518152133148 PC5 6.125962910592555 9.75580559131821 PC6 2.0867224608324513 14.909449205213575 PC7 -1.9778547881605764 4.020370424302356 PC8 1.0566154135484203 3.0492383101905407 PC9 -0.10356116267358671 2.0601000784087495 PC10 4.670396739910904 7.887381787071474 proband 0.06989054168799796 0.03345105140816477 paternal 0.08266941781371662 0.02959574046976095 maternal -0.019148471231257614 0.029618742638294098

PRSice/Plink scores results:

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -11.78773 13.95345 -0.845 0.3984
proband 0.04540 0.03235 1.403 0.1607
father 0.06621 0.02881 2.298 0.0217 mother 0.05871 0.02868 2.047 0.0408 I(AGECHILD13^2) 0.00832 0.07041 0.118 0.9059
AGECHILD13 0.66761 1.96951 0.339 0.7347
GENDER 3.01988 1.94918 1.549 0.1215
PC1 88.17446 94.16606 0.936 0.3492
PC2 31.25962 18.75293 1.667 0.0957 . PC3 -13.27491 17.16911 -0.773 0.4395
PC4 -2.44231 12.83195 -0.190 0.8491
PC5 2.60604 11.24240 0.232 0.8167
PC6 -2.90147 15.41586 -0.188 0.8507
PC7 4.32752 5.38331 0.804 0.4216
PC8 4.18743 3.97015 1.055 0.2917
PC9 -1.04109 2.17161 -0.479 0.6317
PC10 2.63566 8.00310 0.329 0.7419
AGECHILD13:GENDER -0.18463 0.14344 -1.287 0.1982

AlexTISYoung commented 1 year ago

Hi Sam,

Thanks for the email. I haven't noticed this issue before. My guess is that there might be some issue in the way snipar is reading your pedigree file/how your pedigree file is specified.

Your results look to me like the maternal PGI is not matched properly to the proband and paternal PGI since the coefficient on the maternal PGI is not statistically significant. I would suggest checking the correlation between proband, paternal, and maternal PGIs as output in the pgs.txt file. The correlation of the proband and maternal PGI should be at least 0.5, and if it isn't, then something has gone wrong in matching the mothers' genotypes to the right family.

Alex.

On Thu, 31 Aug 2023 at 14:28, Sam Ghatan @.***> wrote:

Hi there,

First of all, I'd like to express my appreciation for creating the snipar tool – it's been incredibly valuable in my research.

However, I've encountered some discrepancies when using snipar to calculate PGIs using the pgs.py script. Specifically, I've noticed discrepancies in the calculated scores and direct/indirect effects when working with non-imputed complete trios compared to results from PLINK and PRSice2 using the same input files.

When using snipar's pgs.py script to compute scores and effects for fathers and mothers within complete trios, I've observed unusual results, particularly with fathers having significantly larger indirect effects compared to mothers. In contrast, when constructing PGIs using PLINK and PRSice2 with identical input files, the resulting scores and direct/indirect effects seem to align more sensibly.

I am currently looking into solutions but was hoping you could point me in the right direction. Is it expected to observe such variations between snipar and PLINK/PRSice2?

I appreciate the help.

Below is an excerpt of the code I've been using as well as the results.

/opt/software/versions/Python-3.8.3/bin/python3.8 ~/bin/pgs.py ./PGS/nonImp_direct --bed /home/r055442/GenR_trio_duos/European/Unphased_plink/EU_trio_duos_qc --weights TB_snps_GWS.txt --pedigree king_pedigree.txt --SNP SNP --beta_col b --A1 A1 --A2 A2 --no_am_adj

snipar results:

intercept -0.3184900915503549 0.4213717452056926 AGECHILD13 0.6187514297850482 0.065049096367718 GENDER 0.5119690037751246 0.0456294872363309 PC1 58.295617419578875 89.89010678220191 PC2 14.199614550388121 16.506109847826696 PC3 -11.658331730011618 16.342479733621758 PC4 -3.4131219226976306 10.229518152133148 PC5 6.125962910592555 9.75580559131821 PC6 2.0867224608324513 14.909449205213575 PC7 -1.9778547881605764 4.020370424302356 PC8 1.0566154135484203 3.0492383101905407 PC9 -0.10356116267358671 2.0601000784087495 PC10 4.670396739910904 7.887381787071474 proband 0.06989054168799796 0.03345105140816477 paternal 0.08266941781371662 0.02959574046976095 maternal -0.019148471231257614 0.029618742638294098

PRSice/Plink scores results:

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -11.78773 13.95345 -0.845 0.3984 proband 0.04540 0.03235 1.403 0.1607 father 0.06621 0.02881 2.298 0.0217 mother 0.05871 0.02868 2.047 0.0408 I(AGECHILD13^2) 0.00832 0.07041 0.118 0.9059 AGECHILD13 0.66761 1.96951 0.339 0.7347 GENDER 3.01988 1.94918 1.549 0.1215 PC1 88.17446 94.16606 0.936 0.3492 PC2 31.25962 18.75293 1.667 0.0957 . PC3 -13.27491 17.16911 -0.773 0.4395 PC4 -2.44231 12.83195 -0.190 0.8491 PC5 2.60604 11.24240 0.232 0.8167 PC6 -2.90147 15.41586 -0.188 0.8507 PC7 4.32752 5.38331 0.804 0.4216 PC8 4.18743 3.97015 1.055 0.2917 PC9 -1.04109 2.17161 -0.479 0.6317 PC10 2.63566 8.00310 0.329 0.7419 AGECHILD13:GENDER -0.18463 0.14344 -1.287 0.1982

— Reply to this email directly, view it on GitHub https://github.com/AlexTISYoung/snipar/issues/31, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQQS6OV54YF7R7FBRAC7ZLXYB7PLANCNFSM6AAAAAA4F72X3M . You are receiving this because you are subscribed to this thread.Message ID: @.***>

PandaPowell commented 1 year ago

Hey Alex, You were right. The pedigree file was incorrectly specified.

Thanks for the quick response!

PandaPowell commented 1 year ago

Hey Alex,

I have a follow up question I hope you would be so kind as to help me with.

I seem to get disparate results between using complete trios and imputed parental alleles. In summary the maternal score seems to lose effect and become non-significant. I am finding it difficult to determine what the cause is.

I will add that roughly 80% of the missing parental are fathers.

Both commands and files are the same with just the addition/subtraction of the -bpg command. Which leads me to believe the pedigrees or agesex and kingship files are correct.

My code:

/opt/software/versions/Python-3.8.3/bin/python3.8 ~/bin/pgs.py ./PGS/${cohort[$j]}/74SNPs${cohort[$j]} \ --pgs ./PGS/combined.pgs.txt \ --phenofile ../tb_sk_snipar_pheno.txt \ --phen_index $j \ --covar ../covfile${cohort[$j]}.txt \ --bpg \ --scale_phen \ --no_am_adj \ --scale_pgs

Only both parents genotyped results:

intercept   -0.5096473684109325     0.21848901984306376 AGECHILD13  0.6009024482576423      0.06755234099267132 GENDER      0.5118769210627334      0.04635395510843317 PC1   104.50736152915553      47.23623549829467 PC2   21.584501453724624      14.141897467290631 PC3   -22.92774192463032      12.97208254125388 PC4   -1.5126741623208027     12.645139717045987 PC5   5.758154848297591 10.391335225405822 PC6   -10.307739960014576     14.053119544351716 PC7   3.207289575517823 5.845930405970802 PC8   3.526201222854244 4.2015694593975565 PC9   -0.8213476799009531     2.234932441022283 PC10  0.44738540962097556     7.985793682523287 proband     0.054340541478601484    0.031372804533886675 paternal    0.060206231848136935    0.028243075099568762 maternal    0.054678893998546405    0.02845820605734315

with imputed parental alleles

intercept   -0.19719780099064688    0.15532277054188862 AGECHILD13  0.6576038120675906      0.05731665083126726 GENDER      0.502235641480488 0.03803410695884392 PC1   30.086762032882053      33.92195398067472 PC2   9.44831381506459  11.02789868388265 PC3   -4.989209082047889      9.369736980972913 PC4   -4.319428162980448      10.020464753319217 PC5   -2.662836357898456      7.9462276679112085 PC6   -8.07838337846003 10.976627066302749 PC7   2.566663152958617 4.55963060367063 PC8   1.1925766359629262      3.272008732415262 PC9   0.46399524496653277     1.827725043719802 PC10  -5.050701199391118      6.626905387398896 proband     0.08149033029452783     0.02690151139278616 paternal    0.059688244168540724    0.0259605586337371 maternal    0.027848242297308926    0.023930370869140738

Correlations between scores:

Imputed score (one parent missing): proband-paternal: 0.64 proband-maternal:0.53 maternal-paternal:0.014

Non-imputed (both parents genotyped): proband-paternal: 0.48 proband-maternal:0.49 maternal-paternal:-0.012

Any nudges in the right direction are appreciated.

Sam

AlexTISYoung commented 1 year ago

It's not obvious to me that there is an issue here. Are the samples independent? If so, it looks to me like the difference in maternal effect estimates is not anywhere near statistically significant.

On Tue, 5 Sept 2023 at 13:57, Sam Ghatan @.***> wrote:

Hey Alex,

I have a follow up question I hope you would be so kind as to help me with.

I seem to get disparate results between using complete trios and imputed parental alleles. In summary the maternal score seems to lose effect and become non-significant. I am finding it difficult to determine what the cause is.

I will add that roughly 80% of the missing parental are fathers.

Both commands and files are the same with just the addition/subtraction of the -bpg command. Which leads me to believe the pedigrees or agesex and kingship files are correct.

My code:

/opt/software/versions/Python-3.8.3/bin/python3.8 ~/bin/pgs.py ./PGS/${cohort[$j]}/74SNPs${cohort[$j]} --pgs ./PGS/combined.pgs.txt --phenofile ../tb_sk_snipar_pheno.txt --phen_index $j --covar ../covfile${cohort[$j]}.txt --bpg --scale_phen --no_am_adj --scale_pgs

Only both parents genotyped results:

intercept -0.5096473684109325 0.21848901984306376 AGECHILD13 0.6009024482576423 0.06755234099267132 GENDER 0.5118769210627334 0.04635395510843317 PC1 104.50736152915553 47.23623549829467 PC2 21.584501453724624 14.141897467290631 PC3 -22.92774192463032 12.97208254125388 PC4 -1.5126741623208027 12.645139717045987 PC5 5.758154848297591 10.391335225405822 PC6 -10.307739960014576 14.053119544351716 PC7 3.207289575517823 5.845930405970802 PC8 3.526201222854244 4.2015694593975565 PC9 -0.8213476799009531 2.234932441022283 PC10 0.44738540962097556 7.985793682523287 proband 0.054340541478601484 0.031372804533886675 paternal 0.060206231848136935 0.028243075099568762 maternal 0.054678893998546405 0.02845820605734315

with imputed parental alleles

intercept -0.19719780099064688 0.15532277054188862 AGECHILD13 0.6576038120675906 0.05731665083126726 GENDER 0.502235641480488 0.03803410695884392 PC1 30.086762032882053 33.92195398067472 PC2 9.44831381506459 11.02789868388265 PC3 -4.989209082047889 9.369736980972913 PC4 -4.319428162980448 10.020464753319217 PC5 -2.662836357898456 7.9462276679112085 PC6 -8.07838337846003 10.976627066302749 PC7 2.566663152958617 4.55963060367063 PC8 1.1925766359629262 3.272008732415262 PC9 0.46399524496653277 1.827725043719802 PC10 -5.050701199391118 6.626905387398896 proband 0.08149033029452783 0.02690151139278616 paternal 0.059688244168540724 0.0259605586337371 maternal 0.027848242297308926 0.023930370869140738

Correlations between scores:

Imputed score (one parent missing): proband-paternal: 0.64 proband-maternal:0.53 maternal-paternal:0.014

Non-imputed (both parents genotyped): proband-paternal: 0.48 proband-maternal:0.49 maternal-paternal:-0.012

Any nudges in the right direction are appreciated.

Sam

— Reply to this email directly, view it on GitHub https://github.com/AlexTISYoung/snipar/issues/31#issuecomment-1706571064, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQQS6JIFIUY6PZUEFSEN3DXY4OSLANCNFSM6AAAAAA4F72X3M . You are receiving this because you commented.Message ID: @.***>

PandaPowell commented 1 year ago

Thanks for the response.

The samples are independent. So you're suggesting that this is normal variance in effect estimates?

What makes me believe there is an issue is the drop in maternal effect estimate. Which goes from boarderline statistically significant to not close to significance, whereas the paternal estimate remains constant. Additionally, this makes the paternal estimate twice as large as the maternal.

Further, if I randomly remove one parent from a bunch of trios and then re-impute their genotypes and re-construct their scores their are discepancies.

I would like to explore potential sources of this problem and work towards resolving it to ensure the reliability of our results.

If you have any insights or suggestions on how to investigate this issue further or any specific aspects of the analysis to scrutinize, it would greatly appreciated.

AlexTISYoung commented 1 year ago

On Fri, 15 Sept 2023 at 07:02, Sam Ghatan @.***> wrote:

Thanks for the response.

The samples are independent. So you're suggesting that this is normal variance in effect estimates?

What makes me believe there is an issue is the drop in maternal effect estimate. Which goes from boarderline statistically significant to not close to significance, whereas the paternal estimate remains constant. Additionally, this makes the paternal estimate twice as large as the maternal.

The difference between significant and not significant is not itself (usually) statistically significant: http://www.stat.columbia.edu/~gelman/research/published/signif4.pdf. It looks to me like totally normal variation in effect size estimate across independent samples.

Further, if I randomly remove one parent from a bunch of trios and then re-impute their genotypes and re-construct their scores their are discepancies.

I would like to explore potential sources of this problem and work towards resolving it to ensure the reliability of our results.

You haven't presented any evidence that there is a problem. Can you give more details on what you think the problem is? If you change from observed to imputed genotypes, then the estimates will change just due to sampling variation. Without the details, I can't assess whether there is an issue or not.

If you have any insights or suggestions on how to investigate this issue further or any specific aspects of the analysis to scrutinize, it would greatly appreciated.

— Reply to this email directly, view it on GitHub https://github.com/AlexTISYoung/snipar/issues/31#issuecomment-1721335027, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQQS6KNKSGCRYYG5N26CQDX2RNWXANCNFSM6AAAAAA4F72X3M . You are receiving this because you commented.Message ID: @.***>

PandaPowell commented 1 year ago

Good points. Thanks Alex!