Open carlos-sarabia opened 5 years ago
The last field of your .ind file should be the population name. You have "Control" Write a script to fix this.
Nick
On Thu, Oct 10, 2019 at 11:47 AM Carlos Sarabia notifications@github.com wrote:
Hi, I am trying to run a test on D-stat with a single chromosome (chr38) of dog using an output of PLINK. I am only considering autosomes.
I first ran convertf in this .par file: genotypename: DATA/chr38.bed snpname: DATA/chr38.bim indivname: DATA/chr38.pedind outputformat: EIGENSTRAT genotypeoutname: testchr38.geno snpoutname: testchr38.snp indivoutname: testchr38.ind numchrom: 38
Input pedind: 1 speciesO.ind1 0 0 1 1 2 species1.ind1 0 0 2 1 3 species1.ind2 0 0 1 1 4 species1.ind3 0 0 2 1 5 species1.ind4 0 0 2 1 6 species1.ind5 0 0 2 1 7 species1.ind6 0 0 1 1 8 species2.ind1 0 0 1 1 9 species2.ind2 0 0 1 1 ...
Output was: testchr38.geno 000000100000000100000010 111101101010010001000101 011001000110100000001100 000001001010000100011000 001001000101101100011000 000000000000000010000000 000000000000000000000001 200000000000000000000000 001101100000000000000000 ...
testchr38.snp chr38_5 38 0.000000 5 C T chr38_6 38 0.000000 6 G A chr38_7 38 0.000000 7 A T chr38_8 38 0.000000 8 T C chr38_10 38 0.000000 10 C T chr38_38 38 0.000000 38 G C chr38_43 38 0.000000 43 C A chr38_242 38 0.000002 242 A G testchr38.ind 1:speciesO.ind1 M Control 2:species1.ind1 F Control 3:species1.ind2 M Control 4:species1.ind3 F Control 5:species1.ind4 F Control 6:species1.ind5 F Control 7:species1.ind6 M Control 8:species2.ind1 M Control 9:species2.ind2 M Control ...
I am trying to see admixture from a species2 into a specific population of species1 (pop1) vs another (species1, pop2). I have speciesO as outgroup. species1.ind1 and .ind2 belong to pop1, species1.ind3 ind4 ind5 and ind6 belong to pop2.
I modified the .ind file to work with pops: ind1 M speciesO ind1 F species1.POP1 ind2 M species1.POP1 ind3 F species1.POP2 ind4 F species1.POP2 ind5 F species1.POP2 ind6 M species1.POP2 ind1 M species2 ind2 M species2 ...
And set up a .popfile with pops (test of contribution of species.POP2 to species2 also included): speciesO species2 species1.POP1 species1,POP2 speciesO species1.POP2 species2 species1.POP1
.par file is: genotypename: testchr38.geno snpname: testchr38.snp indivname: testchr38.modified.ind popfilename: popfile.divided.pops numchrom: 38
when I run qpDstat -p dstat.testchr38.par: THE INPUT PARAMETERS
PARAMETER NAME: VALUE
genotypename: testchr38.geno snpname: testchr38.snp indivname: testchr38.modified.ind popfilename: popfile.divided.groups numchrom: 38 qpDstat version: 755
number of quadruples 3 0 speciesO 1 1 species2 1 2 species1.POP1 4 3 species1.POP2 2 jackknife block size: 0.050 snps: 27956 indivs: 8 number of blocks for jackknife: 1 nrows, ncols: 8 27956 result: speciesO species2 species1.POP1 species1.POP2 0.0000 0.000 0 0 0 result: speciesO species1.POP1 species2 species1.POP2 0.0000 0.000 0 0 0
end of qpDstat: 0.306 seconds cpu 21.373 Mbytes in use
What could be failing?
- There are some additional species in the .bed and .bim not considered for ABBA-BABA
- I am using 38 chromosomes
- Original naming of the individuals when calling .bed and .bim in PLINK were species1.ind1, species1.ind2, species2.ind1, speciesO.ind1 ...
Thanks!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/DReichLab/AdmixTools/issues/57?email_source=notifications&email_token=AEE77B7OYUMSUUHS67HI5V3QN5FBFA5CNFSM4I7OZIM2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HQ7E73A, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEE77B4AXGU3FIGA7V7OWNLQN5FBFANCNFSM4I7OZIMQ .
Hi, thanks for the quick answer. In fact I had modified the .ind file to have the following information:
ind1 M speciesO ind1 F species1.POP1 ind2 M species1.POP1 ind3 F species1.POP2 ind4 F species1.POP2 ind5 F species1.POP2 ind6 M species1.POP2 ind1 M species2 ind2 M species2 ...
The problem arose after working with this new .ind file.
I had a similar issue with simulated data. As I was just testing the approach, I only simulated 20e6 sites, giving genetic length of 0.2. I run qpDstat for the data and got all zeros, including the last column specifying the number of sites in the input.
After hours of searching for possible formatting errors, I looked into the source code. I don't remember details but I think that the program requires at least five jackknife blocks to work and thus genetic length (the third column in *.snp file) of at least 0.25. If the genetic length is below that, the computation is skipped and only zeros are outputted.
I'm not sure if this feature is documented somewhere. It would be nice if a warning was given when the data are too short.
If there is a problem with inadequate error messages send me the logfile; the present thread is too vague to be actionable.
Nick
On Thu, Feb 18, 2021 at 6:57 AM Ari Löytynoja notifications@github.com wrote:
I had a similar issue with simulated data. As I was just testing the approach, I only simulated 20e6 sites, giving genetic length of 0.2. I run qpDstat for the data and got all zeros, including the last column specifying the number of sites in the input.
After hours of searching for possible formatting errors, I looked into the source code. I don't remember details but I think that the program requires at least five jackknife blocks to work and thus genetic length (the third column in *.snp file) of at least 0.25. If the genetic length is below that, the computation is skipped and only zeros are outputted.
I'm not sure if this feature is documented somewhere. It would be nice if a warning was given when the data are too short.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DReichLab/AdmixTools/issues/57#issuecomment-781292616, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEE77BZ7CBBI2R4NEUTK7LTS7T6C7ANCNFSM4I7OZIMQ .
Hi Nick, My interpretation may be wrong. I simulated data with msprime and qpDstat failed until I increased the length of the simulated sequence. Below I've analysed a dataset that works (based on a 50 Mbp sequence) and a subset (first 100k SNPs) that fails. When I multiply the genetic distances by three, the subset also works. I concluded that the genetic distances make a difference (through the number of jackknifing blocks).
Regards, Ari
[lx8:test]$ head -100000 data.snp > data.snp2
[lx8:test]$ head -100000 data.geno > data.geno2
[lx8:test]$ tail -n1 data.snp*
==> data.snp <==
1:49999975 1 0.500000 49999975 G T
==> data.snp2 <==
1:18688448 1 0.186884 18688448 C T
[lx8:test]$ wc data.*
271697 271697 38309277 data.geno
100000 100000 14100000 data.geno2
140 420 1990 data.ind
271697 1630182 17116911 data.snp
100000 600000 6300000 data.snp2
743534 2602299 75828178 total
[lx8:test]$ ../AdmixTools/bin/qpDstat -p params.txt
../AdmixTools/bin/qpDstat: parameter file: params.txt
### THE INPUT PARAMETERS
##PARAMETER NAME: VALUE
indivname: data.ind
snpname: data.snp
genotypename: data.geno
popfilename: listD.txt
## qpDstat version: 970
number of quadruples 3
0 pop4 20
1 pop3 20
2 pop0 20
3 pop1 20
4 pop2 20
5 pop6 20
jackknife block size: 0.050
snps: 271697 indivs: 120
number of blocks for block jackknife: 11
nrows, ncols: 120 271697
result: pop4 pop3 pop0 pop6 -0.1492 -14.794 5427 7329 271697
result: pop4 pop3 pop1 pop6 -0.1925 -12.247 5721 8446 271697
result: pop4 pop3 pop2 pop6 -0.2105 -14.193 5571 8539 271697
##end of qpDstat: 8.058 seconds cpu 286.966 Mbytes in use
[lx8:test]$ ../AdmixTools/bin/qpDstat -p params.txt2
../AdmixTools/bin/qpDstat: parameter file: params.txt2
### THE INPUT PARAMETERS
##PARAMETER NAME: VALUE
indivname: data.ind
snpname: data.snp2
genotypename: data.geno2
popfilename: listD.txt
## qpDstat version: 970
number of quadruples 3
0 pop4 20
1 pop3 20
2 pop0 20
3 pop1 20
4 pop2 20
5 pop6 20
jackknife block size: 0.050
snps: 100000 indivs: 120
number of blocks for block jackknife: 5
nrows, ncols: 120 100000
result: pop4 pop3 pop0 pop6 0.0000 0.000 0 0 0
result: pop4 pop3 pop1 pop6 0.0000 0.000 0 0 0
result: pop4 pop3 pop2 pop6 0.0000 0.000 0 0 0
##end of qpDstat: 2.971 seconds cpu 105.652 Mbytes in use
[lx8:test]$ awk '{print $1,$2,3*$3,$4,$5,$6}' data.snp2 > data.snp3
[lx8:test]$ ../AdmixTools/bin/qpDstat -p params.txt3
../AdmixTools/bin/qpDstat: parameter file: params.txt3
### THE INPUT PARAMETERS
##PARAMETER NAME: VALUE
indivname: data.ind
snpname: data.snp3
genotypename: data.geno2
popfilename: listD.txt
## qpDstat version: 970
number of quadruples 3
0 pop4 20
1 pop3 20
2 pop0 20
3 pop1 20
4 pop2 20
5 pop6 20
jackknife block size: 0.050
snps: 100000 indivs: 120
number of blocks for block jackknife: 13
nrows, ncols: 120 100000
result: pop4 pop3 pop0 pop6 -0.1494 -9.959 1987 2684 100000
result: pop4 pop3 pop1 pop6 -0.1954 -6.903 2133 3168 100000
result: pop4 pop3 pop2 pop6 -0.2165 -9.015 2075 3220 100000
##end of qpDstat: 3.003 seconds cpu 105.652 Mbytes in use
Ari was right. For qpDstat you need at least 5 blocks with non-zero ABBA/BABA counts. If this is not true the program returns zero counts silently, which is bad. A new release (out real soon now) will give a decent error message.
Nick
On Thu, Feb 18, 2021 at 1:20 PM Ari Löytynoja notifications@github.com wrote:
Hi Nick, My interpretation may be wrong. I simulated data with msprime and qpDstat failed until I increased the length of the simulated sequence. Below I've analysed a dataset that works (based on a 50 Mbp sequence) and a subset (first 100k SNPs) that fails. When I multiply the genetic distances by three, the subset also works. I concluded that the genetic distances make a difference (through the number of jackknifing blocks).
Regards, Ari
[lx8:test]$ head -100000 data.snp > data.snp2 [lx8:test]$ head -100000 data.geno > data.geno2
[lx8:test]$ tail -n1 data.snp* ==> data.snp <== 1:49999975 1 0.500000 49999975 G T
==> data.snp2 <== 1:18688448 1 0.186884 18688448 C T
[lx8:test]$ wc data.* 271697 271697 38309277 data.geno 100000 100000 14100000 data.geno2 140 420 1990 data.ind 271697 1630182 17116911 data.snp 100000 600000 6300000 data.snp2 743534 2602299 75828178 total
[lx8:test]$ ../AdmixTools/bin/qpDstat -p params.txt ../AdmixTools/bin/qpDstat: parameter file: params.txt
THE INPUT PARAMETERS
PARAMETER NAME: VALUE
indivname: data.ind snpname: data.snp genotypename: data.geno popfilename: listD.txt
qpDstat version: 970
number of quadruples 3 0 pop4 20 1 pop3 20 2 pop0 20 3 pop1 20 4 pop2 20 5 pop6 20 jackknife block size: 0.050 snps: 271697 indivs: 120 number of blocks for block jackknife: 11 nrows, ncols: 120 271697 result: pop4 pop3 pop0 pop6 -0.1492 -14.794 5427 7329 271697 result: pop4 pop3 pop1 pop6 -0.1925 -12.247 5721 8446 271697 result: pop4 pop3 pop2 pop6 -0.2105 -14.193 5571 8539 271697
end of qpDstat: 8.058 seconds cpu 286.966 Mbytes in use
[lx8:test]$ ../AdmixTools/bin/qpDstat -p params.txt2 ../AdmixTools/bin/qpDstat: parameter file: params.txt2
THE INPUT PARAMETERS
PARAMETER NAME: VALUE
indivname: data.ind snpname: data.snp2 genotypename: data.geno2 popfilename: listD.txt
qpDstat version: 970
number of quadruples 3 0 pop4 20 1 pop3 20 2 pop0 20 3 pop1 20 4 pop2 20 5 pop6 20 jackknife block size: 0.050 snps: 100000 indivs: 120 number of blocks for block jackknife: 5 nrows, ncols: 120 100000 result: pop4 pop3 pop0 pop6 0.0000 0.000 0 0 0 result: pop4 pop3 pop1 pop6 0.0000 0.000 0 0 0 result: pop4 pop3 pop2 pop6 0.0000 0.000 0 0 0
end of qpDstat: 2.971 seconds cpu 105.652 Mbytes in use
[lx8:test]$ awk '{print $1,$2,3*$3,$4,$5,$6}' data.snp2 > data.snp3
[lx8:test]$ ../AdmixTools/bin/qpDstat -p params.txt3 ../AdmixTools/bin/qpDstat: parameter file: params.txt3
THE INPUT PARAMETERS
PARAMETER NAME: VALUE
indivname: data.ind snpname: data.snp3 genotypename: data.geno2 popfilename: listD.txt
qpDstat version: 970
number of quadruples 3 0 pop4 20 1 pop3 20 2 pop0 20 3 pop1 20 4 pop2 20 5 pop6 20 jackknife block size: 0.050 snps: 100000 indivs: 120 number of blocks for block jackknife: 13 nrows, ncols: 120 100000 result: pop4 pop3 pop0 pop6 -0.1494 -9.959 1987 2684 100000 result: pop4 pop3 pop1 pop6 -0.1954 -6.903 2133 3168 100000 result: pop4 pop3 pop2 pop6 -0.2165 -9.015 2075 3220 100000
end of qpDstat: 3.003 seconds cpu 105.652 Mbytes in use
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DReichLab/AdmixTools/issues/57#issuecomment-781542797, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEE77BYBTMW3MKFRR5NFFB3S7VLAHANCNFSM4I7OZIMQ .
Hi, I am trying to run a test on D-stat with a single chromosome (chr38) of dog using an output of PLINK. I am only considering autosomes.
I first ran convertf in this .par file: genotypename: DATA/chr38.bed snpname: DATA/chr38.bim indivname: DATA/chr38.pedind outputformat: EIGENSTRAT genotypeoutname: testchr38.geno snpoutname: testchr38.snp indivoutname: testchr38.ind numchrom: 38
Input pedind: 1 speciesO.ind1 0 0 1 1 2 species1.ind1 0 0 2 1 3 species1.ind2 0 0 1 1 4 species1.ind3 0 0 2 1 5 species1.ind4 0 0 2 1 6 species1.ind5 0 0 2 1 7 species1.ind6 0 0 1 1 8 species2.ind1 0 0 1 1 9 species2.ind2 0 0 1 1 ...
Output was: testchr38.geno 000000100000000100000010 111101101010010001000101 011001000110100000001100 000001001010000100011000 001001000101101100011000 000000000000000010000000 000000000000000000000001 200000000000000000000000 001101100000000000000000 ...
testchr38.snp chr38_5 38 0.000000 5 C T chr38_6 38 0.000000 6 G A chr38_7 38 0.000000 7 A T chr38_8 38 0.000000 8 T C chr38_10 38 0.000000 10 C T chr38_38 38 0.000000 38 G C chr38_43 38 0.000000 43 C A chr38_242 38 0.000002 242 A G
testchr38.ind 1:speciesO.ind1 M Control 2:species1.ind1 F Control 3:species1.ind2 M Control 4:species1.ind3 F Control 5:species1.ind4 F Control 6:species1.ind5 F Control 7:species1.ind6 M Control 8:species2.ind1 M Control 9:species2.ind2 M Control ...
I am trying to see admixture from a species2 into a specific population of species1 (pop1) vs another (species1, pop2). I have speciesO as outgroup. species1.ind1 and .ind2 belong to pop1, species1.ind3 ind4 ind5 and ind6 belong to pop2.
I modified the .ind file to work with pops: ind1 M speciesO ind1 F species1.POP1 ind2 M species1.POP1 ind3 F species1.POP2 ind4 F species1.POP2 ind5 F species1.POP2 ind6 M species1.POP2 ind1 M species2 ind2 M species2 ...
And set up a .popfile with pops (test of contribution of species.POP2 to species2 also included): speciesO species2 species1.POP1 species1,POP2 speciesO species1.POP2 species2 species1.POP1
.par file is: genotypename: testchr38.geno snpname: testchr38.snp indivname: testchr38.modified.ind popfilename: popfile.divided.pops numchrom: 38
when I run qpDstat -p dstat.testchr38.par:
THE INPUT PARAMETERS
PARAMETER NAME: VALUE
genotypename: testchr38.geno snpname: testchr38.snp indivname: testchr38.modified.ind popfilename: popfile.divided.groups numchrom: 38
qpDstat version: 755
number of quadruples 3 0 speciesO 1 1 species2 1 2 species1.POP1 4 3 species1.POP2 2 jackknife block size: 0.050 snps: 27956 indivs: 8 number of blocks for jackknife: 1 nrows, ncols: 8 27956 result: speciesO species2 species1.POP1 species1.POP2 0.0000 0.000 0 0 0 result: speciesO species1.POP1 species2 species1.POP2 0.0000 0.000 0 0 0
end of qpDstat: 0.306 seconds cpu 21.373 Mbytes in use
What could be failing?
Thanks!