-doASSO with gl or beagle input

MichelMoser commented 7 years ago

Hello,

I try to detect association using already computed GL files like beagle.gz or glf.gz . As i have some quite heavy bam-files, i thought i can skip the processing of them (takes about 2 days on 8 cores) and feed the GL files directly to angsd for association calculation.

Unfortunately i must be doing something really wrong as but i dont know exactly what.

I would expect angsd to compute LRT for all the sites in the GL files:

marker  allele1 allele2 Ind0
Peaxi162Scf00000_5300   3       0       0.391864
Peaxi162Scf00000_5365   1       3       0.748007
Peaxi162Scf00000_5367   3       2       0.748007
Peaxi162Scf00000_8465   0       3       0.067234
Peaxi162Scf00000_9674   1       3       0.799785
Peaxi162Scf00000_9729   0       3       0.000272
Peaxi162Scf00000_10309  1       3       0.333333
Peaxi162Scf00000_10342  3       1       0.666224
Peaxi162Scf00000_11253  2       0       0.526842

Instead i get output for almost every base in the genome with LRT values of either NAN or -0.00000 (which i would expect for non-informative sites).

command used:

./angsd -yBin STYLEpheno.ybin -doAsso 1 -out asso -doMajorMinor 1 -doMaf 1 -SNP_pval 1e-6 -beagle gwas_min20_b.beagle.gz -P 8 -fai /data/users/mmoser/Peaxi_genome_v1.6.2.scaffolds.fasta.fai -nInd 22

stderr get huge within minutes printing:

  -> angsd version: 0.917-72-gbebe199 (htslib: 1.4-1-g879855b) build(Mar 15 2017 07:55:45)
        -> SNP-filter using a pvalue: 1.000000e-01 correspond to 2.705543 likelihood units
PRE[0]:gls:(18099639291645254506538595914408908062395263746129366737630871885930377648130683888582123081219592143850665802530280975341806367075488872139570180896128058704912293053770607157607116715995985756433563034728303539185503961088.000000,2767112490107621392238289215098381863946232917995916836554276144231700935603773245422333866395049535367096560149772673013883436730679952728253076183880438906880.000000,2795000724386476046756428915975067809295011262972632342523610104026247682311154041341485209621448795886717686836312260499856077544936580363470207733742566375424.000000) W(inf,inf,inf) sum=-nan
PRE[1]:gls:(0.000000,0.000000,59923796594707289261245125085279715483891488469150419486307847325931360098150797397651098722911263796367847090485210319865543917788582602940653602648721010181560210751106318336.000000) W(0.998001,0.001998,inf) sum=-nan
PRE[2]:gls:(0.000000,0.000000,0.000000) W(0.998001,0.001998,0.000001) sum=-nan
PRE[3]:gls:(0.000000,0.000000,0.000000) W(0.998001,0.001998,0.000001) sum=-nan
PRE[4]:gls:(0.000000,0.000000,0.000000) W(0.998001,0.001998,0.000001) sum=-nan
PRE[5]:gls:(0.000000,0.000000,0.000000) W(0.998001,0.001998,0.000001) sum=-nan
PRE[6]:gls:(0.000000,0.000000,0.000000) W(0.998001,0.001998,0.000001) sum=-nan
PRE[7]:gls:(0.000000,0.000000,0.000000) W(0.998001,0.001998,0.000001) sum=-nan
PRE[8]:gls:(0.000000,0.000000,0.000000) W(0.998001,0.001998,0.000001) sum=-nan
PRE[9]:gls:(0.000000,0.000000,0.000000) W(0.998001,0.001998,0.000001) sum=-nan
PRE[10]:gls:(164227739222643006852379790353624379785784233973920265194389169849318395284453771513303005893840618366064774289023592904147421420454299164264389573923598277574215174550100225309854924800.000000,0.000000,0.000000) W(inf,0.001998,0.000001) sum=-nan
PRE[0]:gls:(18099639291645254506538595914408908062395263746129366737630871885930377648130683888582123081219592143850665802530280975341806367075488872139570180896128058704912293053770607157607116715995985756433563034728303539185503961088.000000,2767112490107621392238289215098381863946232917995916836554276144231700935603773245422333866395049535367096560149772673013883436730679952728253076183880438906880.000000,2795000724386476046756428915975067809295011262972632342523610104026247682311154041341485209621448795886717686836312260499856077544936580363470207733742566375424.000000) W(-nan,-nan,-nan) sum=-nan
PRE[1]:gls:(0.000000,0.000000,59923796594707289261245125085279715483891488469150419486307847325931360098150797397651098722911263796367847090485210319865543917788582602940653602648721010181560210751106318336.000000) W(-nan,-nan,-nan) sum=-nan
PRE[2]:gls:(0.000000,0.000000,0.000000) W(-nan,-nan,-nan) sum=-nan
PRE[3]:gls:(0.000000,0.000000,0.000000) W(-nan,-nan,-nan) sum=-nan
PRE[4]:gls:(0.000000,0.000000,0.000000) W(-nan,-nan,-nan) sum=-nan
PRE[5]:gls:(0.000000,0.000000,0.000000) W(-nan,-nan,-nan) sum=-nan
PRE[6]:gls:(0.000000,0.000000,0.000000) W(-nan,-nan,-nan) sum=-nan
PRE[7]:gls:(0.000000,0.000000,0.000000) W(-nan,-nan,-nan) sum=-nan
PRE[8]:gls:(0.000000,0.000000,0.000000) W(-nan,-nan,-nan) sum=-nan
PRE[9]:gls:(0.000000,0.000000,0.000000) W(-nan,-nan,-nan) sum=-nan
PRE[10]:gls:(164227739222643006852379790353624379785784233973920265194389169849318395284453771513303005893840618366064774289023592904147421420454299164264389573923598277574215174550100225309854924800.000000,0.000000,0.000000) W(-nan,-nan,-nan) sum=-nan
........

the output looks like:

Chromosome      Position        Major   Minor   Frequency       LRT
Peaxi162Scf00000        1       C       A       0.333333        -nan
Peaxi162Scf00000        5       C       A       0.500000        -0.000000
Peaxi162Scf00000        6       C       A       0.250000        -nan
Peaxi162Scf00000        7       G       A       0.250000        -nan
Peaxi162Scf00000        9       C       A       0.500000        -nan
Peaxi162Scf00000        11      C       A       0.500000        -0.000000
Peaxi162Scf00000        12      C       A       0.500000        -0.000000
Peaxi162Scf00000        15      C       A       0.500000        -nan
Peaxi162Scf00000        16      T       A       0.500000        -0.000000
Peaxi162Scf00000        18      C       A       0.250000        -0.000000
Peaxi162Scf00000        20      C       A       0.500000        -nan
Peaxi162Scf00000        21      G       A       0.500000        -nan
Peaxi162Scf00000        23      C       A       0.500000        -0.000000
Peaxi162Scf00000        24      T       A       0.250000        -nan
Peaxi162Scf00000        25      C       A       0.500000        -nan
Peaxi162Scf00000        28      G       A       0.250000        -0.000000
Peaxi162Scf00000        30      C       A       1.000000        -0.000000
Peaxi162Scf00000        33      C       A       0.500000        -nan
Peaxi162Scf00000        35      C       A       0.500000        -0.000000
Peaxi162Scf00000        37      C       A       0.500000        -nan
Peaxi162Scf00000        39      C       A       0.500000        -nan
Peaxi162Scf00000        40      C       A       0.500000        -0.000000
Peaxi162Scf00000        42      G       A       0.250000        -nan
Peaxi162Scf00000        43      C       A       0.500000        -nan
Peaxi162Scf00000        44      T       A       0.500000        -0.000000
Peaxi162Scf00000        46      C       A       0.250000        -0.000000

Small follow-up question:

If i try to use beagle 4.1 for imputations, the like= option is not recognized anymore. Do you know if i have to format the angsd-beagle-file to vcf or is there an alternative command to use?

Sorry for such basic questions about how to use the tool but i could not find information about this anywhere.

Thank you, Michel

aalbrechtsen commented 7 years ago

Dear Michel

Your command "./angsd -yBin STYLEpheno.ybin -doAsso 1 -out asso -doMajorMinor 1 -doMaf 1 -SNP_pval 1e-6 -beagle gwas_min20_b.beagle.gz -P 8 -fai /data/users/mmoser/Peaxi_genome_v1.6.2.scaffolds.fasta.fai -nInd 22"

should give the error " -> Potential problem: Cannot estimate the major and minor based on posterior probabilities"

Are you sure this is the command you used?

ANGSD assumes that the -beagle input is posterior probabilities (such as the probability you get from imputation) and not genotype likelihoods.

-Anders

MichelMoser commented 7 years ago

Hello Anders,

I currently cant reproduce the commands and output but you are right about the warning given the command i posted.

Could you tell me how to run Assocation test withouth the beagle imputation? Because newer versions of beagle only take vcf as input format (to my knowledge). I could write some conversion but to save time it would be easier without.

As -beagle did not work i ran angsd with the bam files as input:

command:

/home/mmoser/angsd/angsd -yBin STYLEpheno.ybin -GL 1 -doAsso 1 -out asso -doMajorMinor 1 -doMaf 1 -bam bam.filelist -P 8 -doPost 1 -fai /data/users/mmoser/Peaxi_genome_v1.6.2.scaffolds.fasta.fai -nInd 22 -SNP_pval 1e-2

output of asso.lrt0.gz

Chromosome      Position        Major   Minor   Frequency       LRT
Peaxi162Scf00000        5300    T       A       0.302125        -0.000000
Peaxi162Scf00000        5365    C       T       0.258312        -0.000000
Peaxi162Scf00000        5366    C       T       0.063891        -0.000000
Peaxi162Scf00000        5367    T       G       0.237802        -0.000000
Peaxi162Scf00000        5990    C       T       0.075368        -0.000000
Peaxi162Scf00000        6014    A       G       0.232140        -0.000000
Peaxi162Scf00000        6022    C       G       0.067783        -0.000000
Peaxi162Scf00000        6036    G       T       0.113906        -0.000000
Peaxi162Scf00000        6066    C       T       0.286320        -0.000000

All the LRT values were -0.000. Does that mean they did not get under the p-value threshold of 0.001? Or is there something wrong with the command?

Thank you, Michel

phenotype_file:

beagle.gz:

marker  allele1 allele2 Ind0    Ind0    Ind0    Ind1    Ind1    Ind1    Ind2    Ind2    Ind2    Ind3    Ind3    Ind3    Ind4    Ind4    Ind4    Ind5    Ind5    Ind5    Ind6    Ind6      Ind6    Ind7    Ind7    Ind7    Ind8    Ind8    Ind8    Ind9    Ind9    Ind9    Ind10   Ind10   Ind10   Ind11   Ind11   Ind11   Ind12   Ind12   Ind12   Ind13   Ind13   Ind13     Ind14   Ind14   Ind14   Ind15   Ind15   Ind15   Ind16   Ind16   Ind16   Ind17   Ind17   Ind17   Ind18   Ind18   Ind18   Ind19   Ind19   Ind19   Ind20   Ind20   Ind20   Ind21     Ind21   Ind21
Peaxi162Scf00000_5300   3       0       0.391864        0.357039        0.251098        0.017502        0.459925        0.522574        0.988413        0.007722        0.003866 0.267768 0.287485        0.444747        0.038356        0.337209        0.624435        0.000000        0.111109        0.888891        0.735745        0.183933        0.080322 0.143434 0.390432        0.466134        0.571042        0.285519        0.143439        0.359192        0.281617        0.359192        0.735745        0.183933        0.080322 0.041756 0.525561        0.432683        0.333333        0.333333        0.333333        0.709983        0.177493        0.112525        0.571042        0.285519        0.143439 0.115131 0.471033        0.413836        0.980943        0.015326        0.003731        0.007155        0.425449        0.567395        0.951439        0.029731        0.018830 0.000102 0.644380        0.355518        0.768288        0.192069        0.039643        0.861854        0.107729        0.030418
Peaxi162Scf00000_5365   1       3       0.748007        0.186999        0.064994        0.054067        0.585038        0.360895        0.960903        0.030027        0.009070 0.034268 0.680486        0.285247        0.379640        0.336747        0.283612        0.000000        0.267206        0.732794        0.926158        0.057883        0.015959 0.980473 0.007659        0.011868        0.689947        0.172484        0.137569        0.002100        0.531195        0.466705        0.901829        0.056362        0.041809 0.040363 0.510833        0.448803        0.655688        0.327841        0.016470        0.604219        0.305482        0.090299        0.276655        0.385051        0.338294 0.934467 0.058402        0.007131        0.474689        0.365449        0.159862        0.006588        0.390028        0.603384        0.720803        0.180198        0.098999 0.008419 0.396898        0.594683        0.333333        0.333333        0.333333        0.526842        0.263419        0.209740

aalbrechtsen commented 7 years ago

Hi Michael

If you want to perform association without imputation then you can either use the test based on allele frequency differences between cases and controls (-doAsso 1) og use the logistic regression model (-doAsso 2).

Examples of how to run then are given on the wiki http://www.popgen.dk/angsd/index.php/Association

Since you only have 22 individuals it will be very hard to find anything. Usually a GWAS has thousands of individuals. A LRT of 0 (as in your output) means a p-value of one. Most likely the site is not polymorphic (you choose a very lenient threshold ) or there is no information in the cases or the control. You have read about the output here: http://www.popgen.dk/angsd/index.php/Association#Output

If you want to do imputation then you can use beagle 3 https://faculty.washington.edu/browning/beagle/beagle.jar https://faculty.washington.edu/browning/beagle/beagle_3.3.2_31Oct11.pdf

-Anders

ANGSD / angsd

-doASSO with gl or beagle input #77