DReichLab / AdmixTools

Tools test whether admixture occurred and more
183 stars 64 forks source link

All D-stat values are zero #57

Open carlos-sarabia opened 5 years ago

carlos-sarabia commented 5 years ago

Hi, I am trying to run a test on D-stat with a single chromosome (chr38) of dog using an output of PLINK. I am only considering autosomes.

I first ran convertf in this .par file: genotypename: DATA/chr38.bed snpname: DATA/chr38.bim indivname: DATA/chr38.pedind outputformat: EIGENSTRAT genotypeoutname: testchr38.geno snpoutname: testchr38.snp indivoutname: testchr38.ind numchrom: 38

Input pedind: 1 speciesO.ind1 0 0 1 1 2 species1.ind1 0 0 2 1 3 species1.ind2 0 0 1 1 4 species1.ind3 0 0 2 1 5 species1.ind4 0 0 2 1 6 species1.ind5 0 0 2 1 7 species1.ind6 0 0 1 1 8 species2.ind1 0 0 1 1 9 species2.ind2 0 0 1 1 ...

Output was: testchr38.geno 000000100000000100000010 111101101010010001000101 011001000110100000001100 000001001010000100011000 001001000101101100011000 000000000000000010000000 000000000000000000000001 200000000000000000000000 001101100000000000000000 ...

testchr38.snp chr38_5 38 0.000000 5 C T chr38_6 38 0.000000 6 G A chr38_7 38 0.000000 7 A T chr38_8 38 0.000000 8 T C chr38_10 38 0.000000 10 C T chr38_38 38 0.000000 38 G C chr38_43 38 0.000000 43 C A chr38_242 38 0.000002 242 A G

testchr38.ind 1:speciesO.ind1 M Control 2:species1.ind1 F Control 3:species1.ind2 M Control 4:species1.ind3 F Control 5:species1.ind4 F Control 6:species1.ind5 F Control 7:species1.ind6 M Control 8:species2.ind1 M Control 9:species2.ind2 M Control ...

I am trying to see admixture from a species2 into a specific population of species1 (pop1) vs another (species1, pop2). I have speciesO as outgroup. species1.ind1 and .ind2 belong to pop1, species1.ind3 ind4 ind5 and ind6 belong to pop2.

I modified the .ind file to work with pops: ind1 M speciesO ind1 F species1.POP1 ind2 M species1.POP1 ind3 F species1.POP2 ind4 F species1.POP2 ind5 F species1.POP2 ind6 M species1.POP2 ind1 M species2 ind2 M species2 ...

And set up a .popfile with pops (test of contribution of species.POP2 to species2 also included): speciesO species2 species1.POP1 species1,POP2 speciesO species1.POP2 species2 species1.POP1

.par file is: genotypename: testchr38.geno snpname: testchr38.snp indivname: testchr38.modified.ind popfilename: popfile.divided.pops numchrom: 38

when I run qpDstat -p dstat.testchr38.par:

THE INPUT PARAMETERS

PARAMETER NAME: VALUE

genotypename: testchr38.geno snpname: testchr38.snp indivname: testchr38.modified.ind popfilename: popfile.divided.groups numchrom: 38

qpDstat version: 755

number of quadruples 3 0 speciesO 1 1 species2 1 2 species1.POP1 4 3 species1.POP2 2 jackknife block size: 0.050 snps: 27956 indivs: 8 number of blocks for jackknife: 1 nrows, ncols: 8 27956 result: speciesO species2 species1.POP1 species1.POP2 0.0000 0.000 0 0 0 result: speciesO species1.POP1 species2 species1.POP2 0.0000 0.000 0 0 0

end of qpDstat: 0.306 seconds cpu 21.373 Mbytes in use


What could be failing?

  1. There are some additional species in the .bed and .bim not considered for ABBA-BABA
  2. I am using 38 chromosomes
  3. Original naming of the individuals when calling .bed and .bim in PLINK were species1.ind1, species1.ind2, species2.ind1, speciesO.ind1 ...

Thanks!

bumblenick commented 5 years ago

The last field of your .ind file should be the population name. You have "Control" Write a script to fix this.

Nick

On Thu, Oct 10, 2019 at 11:47 AM Carlos Sarabia notifications@github.com wrote:

Hi, I am trying to run a test on D-stat with a single chromosome (chr38) of dog using an output of PLINK. I am only considering autosomes.

I first ran convertf in this .par file: genotypename: DATA/chr38.bed snpname: DATA/chr38.bim indivname: DATA/chr38.pedind outputformat: EIGENSTRAT genotypeoutname: testchr38.geno snpoutname: testchr38.snp indivoutname: testchr38.ind numchrom: 38

Input pedind: 1 speciesO.ind1 0 0 1 1 2 species1.ind1 0 0 2 1 3 species1.ind2 0 0 1 1 4 species1.ind3 0 0 2 1 5 species1.ind4 0 0 2 1 6 species1.ind5 0 0 2 1 7 species1.ind6 0 0 1 1 8 species2.ind1 0 0 1 1 9 species2.ind2 0 0 1 1 ...

Output was: testchr38.geno 000000100000000100000010 111101101010010001000101 011001000110100000001100 000001001010000100011000 001001000101101100011000 000000000000000010000000 000000000000000000000001 200000000000000000000000 001101100000000000000000 ...

testchr38.snp chr38_5 38 0.000000 5 C T chr38_6 38 0.000000 6 G A chr38_7 38 0.000000 7 A T chr38_8 38 0.000000 8 T C chr38_10 38 0.000000 10 C T chr38_38 38 0.000000 38 G C chr38_43 38 0.000000 43 C A chr38_242 38 0.000002 242 A G testchr38.ind 1:speciesO.ind1 M Control 2:species1.ind1 F Control 3:species1.ind2 M Control 4:species1.ind3 F Control 5:species1.ind4 F Control 6:species1.ind5 F Control 7:species1.ind6 M Control 8:species2.ind1 M Control 9:species2.ind2 M Control ...

I am trying to see admixture from a species2 into a specific population of species1 (pop1) vs another (species1, pop2). I have speciesO as outgroup. species1.ind1 and .ind2 belong to pop1, species1.ind3 ind4 ind5 and ind6 belong to pop2.

I modified the .ind file to work with pops: ind1 M speciesO ind1 F species1.POP1 ind2 M species1.POP1 ind3 F species1.POP2 ind4 F species1.POP2 ind5 F species1.POP2 ind6 M species1.POP2 ind1 M species2 ind2 M species2 ...

And set up a .popfile with pops (test of contribution of species.POP2 to species2 also included): speciesO species2 species1.POP1 species1,POP2 speciesO species1.POP2 species2 species1.POP1

.par file is: genotypename: testchr38.geno snpname: testchr38.snp indivname: testchr38.modified.ind popfilename: popfile.divided.pops numchrom: 38

when I run qpDstat -p dstat.testchr38.par: THE INPUT PARAMETERS

PARAMETER NAME: VALUE

genotypename: testchr38.geno snpname: testchr38.snp indivname: testchr38.modified.ind popfilename: popfile.divided.groups numchrom: 38 qpDstat version: 755

number of quadruples 3 0 speciesO 1 1 species2 1 2 species1.POP1 4 3 species1.POP2 2 jackknife block size: 0.050 snps: 27956 indivs: 8 number of blocks for jackknife: 1 nrows, ncols: 8 27956 result: speciesO species2 species1.POP1 species1.POP2 0.0000 0.000 0 0 0 result: speciesO species1.POP1 species2 species1.POP2 0.0000 0.000 0 0 0

end of qpDstat: 0.306 seconds cpu 21.373 Mbytes in use


What could be failing?

  1. There are some additional species in the .bed and .bim not considered for ABBA-BABA
  2. I am using 38 chromosomes
  3. Original naming of the individuals when calling .bed and .bim in PLINK were species1.ind1, species1.ind2, species2.ind1, speciesO.ind1 ...

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/DReichLab/AdmixTools/issues/57?email_source=notifications&email_token=AEE77B7OYUMSUUHS67HI5V3QN5FBFA5CNFSM4I7OZIM2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HQ7E73A, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEE77B4AXGU3FIGA7V7OWNLQN5FBFANCNFSM4I7OZIMQ .

carlos-sarabia commented 5 years ago

Hi, thanks for the quick answer. In fact I had modified the .ind file to have the following information:

ind1 M speciesO ind1 F species1.POP1 ind2 M species1.POP1 ind3 F species1.POP2 ind4 F species1.POP2 ind5 F species1.POP2 ind6 M species1.POP2 ind1 M species2 ind2 M species2 ...

The problem arose after working with this new .ind file.

ariloytynoja commented 3 years ago

I had a similar issue with simulated data. As I was just testing the approach, I only simulated 20e6 sites, giving genetic length of 0.2. I run qpDstat for the data and got all zeros, including the last column specifying the number of sites in the input.

After hours of searching for possible formatting errors, I looked into the source code. I don't remember details but I think that the program requires at least five jackknife blocks to work and thus genetic length (the third column in *.snp file) of at least 0.25. If the genetic length is below that, the computation is skipped and only zeros are outputted.

I'm not sure if this feature is documented somewhere. It would be nice if a warning was given when the data are too short.

bumblenick commented 3 years ago

If there is a problem with inadequate error messages send me the logfile; the present thread is too vague to be actionable.

Nick

On Thu, Feb 18, 2021 at 6:57 AM Ari Löytynoja notifications@github.com wrote:

I had a similar issue with simulated data. As I was just testing the approach, I only simulated 20e6 sites, giving genetic length of 0.2. I run qpDstat for the data and got all zeros, including the last column specifying the number of sites in the input.

After hours of searching for possible formatting errors, I looked into the source code. I don't remember details but I think that the program requires at least five jackknife blocks to work and thus genetic length (the third column in *.snp file) of at least 0.25. If the genetic length is below that, the computation is skipped and only zeros are outputted.

I'm not sure if this feature is documented somewhere. It would be nice if a warning was given when the data are too short.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DReichLab/AdmixTools/issues/57#issuecomment-781292616, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEE77BZ7CBBI2R4NEUTK7LTS7T6C7ANCNFSM4I7OZIMQ .

ariloytynoja commented 3 years ago

Hi Nick, My interpretation may be wrong. I simulated data with msprime and qpDstat failed until I increased the length of the simulated sequence. Below I've analysed a dataset that works (based on a 50 Mbp sequence) and a subset (first 100k SNPs) that fails. When I multiply the genetic distances by three, the subset also works. I concluded that the genetic distances make a difference (through the number of jackknifing blocks).

Regards, Ari

[lx8:test]$ head -100000 data.snp > data.snp2
[lx8:test]$ head -100000 data.geno > data.geno2 

[lx8:test]$ tail -n1 data.snp*
==> data.snp <==
          1:49999975     1        0.500000        49999975 G T

==> data.snp2 <==
          1:18688448     1        0.186884        18688448 C T

[lx8:test]$ wc  data.*
  271697   271697 38309277 data.geno
  100000   100000 14100000 data.geno2
     140      420     1990 data.ind
  271697  1630182 17116911 data.snp
  100000   600000  6300000 data.snp2
  743534  2602299 75828178 total

[lx8:test]$ ../AdmixTools/bin/qpDstat -p params.txt 
../AdmixTools/bin/qpDstat: parameter file: params.txt
### THE INPUT PARAMETERS
##PARAMETER NAME: VALUE
indivname: data.ind
snpname: data.snp
genotypename: data.geno
popfilename: listD.txt
## qpDstat version: 970
number of quadruples 3
  0                 pop4   20
  1                 pop3   20
  2                 pop0   20
  3                 pop1   20
  4                 pop2   20
  5                 pop6   20
jackknife block size:     0.050
snps: 271697  indivs: 120
number of blocks for block jackknife: 11
nrows, ncols: 120 271697
result:       pop4       pop3       pop0       pop6     -0.1492    -14.794    5427   7329 271697 
result:       pop4       pop3       pop1       pop6     -0.1925    -12.247    5721   8446 271697 
result:       pop4       pop3       pop2       pop6     -0.2105    -14.193    5571   8539 271697 
##end of qpDstat:        8.058 seconds cpu      286.966 Mbytes in use

[lx8:test]$ ../AdmixTools/bin/qpDstat -p params.txt2 
../AdmixTools/bin/qpDstat: parameter file: params.txt2
### THE INPUT PARAMETERS
##PARAMETER NAME: VALUE
indivname: data.ind
snpname: data.snp2
genotypename: data.geno2
popfilename: listD.txt
## qpDstat version: 970
number of quadruples 3
  0                 pop4   20
  1                 pop3   20
  2                 pop0   20
  3                 pop1   20
  4                 pop2   20
  5                 pop6   20
jackknife block size:     0.050
snps: 100000  indivs: 120
number of blocks for block jackknife: 5
nrows, ncols: 120 100000
result:       pop4       pop3       pop0       pop6      0.0000      0.000       0      0      0 
result:       pop4       pop3       pop1       pop6      0.0000      0.000       0      0      0 
result:       pop4       pop3       pop2       pop6      0.0000      0.000       0      0      0 
##end of qpDstat:        2.971 seconds cpu      105.652 Mbytes in use

[lx8:test]$ awk '{print $1,$2,3*$3,$4,$5,$6}' data.snp2 > data.snp3

[lx8:test]$ ../AdmixTools/bin/qpDstat -p params.txt3
../AdmixTools/bin/qpDstat: parameter file: params.txt3
### THE INPUT PARAMETERS
##PARAMETER NAME: VALUE
indivname: data.ind
snpname: data.snp3
genotypename: data.geno2
popfilename: listD.txt
## qpDstat version: 970
number of quadruples 3
  0                 pop4   20
  1                 pop3   20
  2                 pop0   20
  3                 pop1   20
  4                 pop2   20
  5                 pop6   20
jackknife block size:     0.050
snps: 100000  indivs: 120
number of blocks for block jackknife: 13
nrows, ncols: 120 100000
result:       pop4       pop3       pop0       pop6     -0.1494     -9.959    1987   2684 100000 
result:       pop4       pop3       pop1       pop6     -0.1954     -6.903    2133   3168 100000 
result:       pop4       pop3       pop2       pop6     -0.2165     -9.015    2075   3220 100000 
##end of qpDstat:        3.003 seconds cpu      105.652 Mbytes in use
bumblenick commented 3 years ago

Ari was right. For qpDstat you need at least 5 blocks with non-zero ABBA/BABA counts. If this is not true the program returns zero counts silently, which is bad. A new release (out real soon now) will give a decent error message.

Nick

On Thu, Feb 18, 2021 at 1:20 PM Ari Löytynoja notifications@github.com wrote:

Hi Nick, My interpretation may be wrong. I simulated data with msprime and qpDstat failed until I increased the length of the simulated sequence. Below I've analysed a dataset that works (based on a 50 Mbp sequence) and a subset (first 100k SNPs) that fails. When I multiply the genetic distances by three, the subset also works. I concluded that the genetic distances make a difference (through the number of jackknifing blocks).

Regards, Ari

[lx8:test]$ head -100000 data.snp > data.snp2 [lx8:test]$ head -100000 data.geno > data.geno2

[lx8:test]$ tail -n1 data.snp* ==> data.snp <== 1:49999975 1 0.500000 49999975 G T

==> data.snp2 <== 1:18688448 1 0.186884 18688448 C T

[lx8:test]$ wc data.* 271697 271697 38309277 data.geno 100000 100000 14100000 data.geno2 140 420 1990 data.ind 271697 1630182 17116911 data.snp 100000 600000 6300000 data.snp2 743534 2602299 75828178 total

[lx8:test]$ ../AdmixTools/bin/qpDstat -p params.txt ../AdmixTools/bin/qpDstat: parameter file: params.txt

THE INPUT PARAMETERS

PARAMETER NAME: VALUE

indivname: data.ind snpname: data.snp genotypename: data.geno popfilename: listD.txt

qpDstat version: 970

number of quadruples 3 0 pop4 20 1 pop3 20 2 pop0 20 3 pop1 20 4 pop2 20 5 pop6 20 jackknife block size: 0.050 snps: 271697 indivs: 120 number of blocks for block jackknife: 11 nrows, ncols: 120 271697 result: pop4 pop3 pop0 pop6 -0.1492 -14.794 5427 7329 271697 result: pop4 pop3 pop1 pop6 -0.1925 -12.247 5721 8446 271697 result: pop4 pop3 pop2 pop6 -0.2105 -14.193 5571 8539 271697

end of qpDstat: 8.058 seconds cpu 286.966 Mbytes in use

[lx8:test]$ ../AdmixTools/bin/qpDstat -p params.txt2 ../AdmixTools/bin/qpDstat: parameter file: params.txt2

THE INPUT PARAMETERS

PARAMETER NAME: VALUE

indivname: data.ind snpname: data.snp2 genotypename: data.geno2 popfilename: listD.txt

qpDstat version: 970

number of quadruples 3 0 pop4 20 1 pop3 20 2 pop0 20 3 pop1 20 4 pop2 20 5 pop6 20 jackknife block size: 0.050 snps: 100000 indivs: 120 number of blocks for block jackknife: 5 nrows, ncols: 120 100000 result: pop4 pop3 pop0 pop6 0.0000 0.000 0 0 0 result: pop4 pop3 pop1 pop6 0.0000 0.000 0 0 0 result: pop4 pop3 pop2 pop6 0.0000 0.000 0 0 0

end of qpDstat: 2.971 seconds cpu 105.652 Mbytes in use

[lx8:test]$ awk '{print $1,$2,3*$3,$4,$5,$6}' data.snp2 > data.snp3

[lx8:test]$ ../AdmixTools/bin/qpDstat -p params.txt3 ../AdmixTools/bin/qpDstat: parameter file: params.txt3

THE INPUT PARAMETERS

PARAMETER NAME: VALUE

indivname: data.ind snpname: data.snp3 genotypename: data.geno2 popfilename: listD.txt

qpDstat version: 970

number of quadruples 3 0 pop4 20 1 pop3 20 2 pop0 20 3 pop1 20 4 pop2 20 5 pop6 20 jackknife block size: 0.050 snps: 100000 indivs: 120 number of blocks for block jackknife: 13 nrows, ncols: 120 100000 result: pop4 pop3 pop0 pop6 -0.1494 -9.959 1987 2684 100000 result: pop4 pop3 pop1 pop6 -0.1954 -6.903 2133 3168 100000 result: pop4 pop3 pop2 pop6 -0.2165 -9.015 2075 3220 100000

end of qpDstat: 3.003 seconds cpu 105.652 Mbytes in use

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DReichLab/AdmixTools/issues/57#issuecomment-781542797, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEE77BYBTMW3MKFRR5NFFB3S7VLAHANCNFSM4I7OZIMQ .