DReichLab / EIG

Eigen tools by Nick Patterson and Alkes Price lab
Other
180 stars 60 forks source link

mergeit not working as expected #64

Closed AdrienOliva closed 3 years ago

AdrienOliva commented 3 years ago

Hi, I have noticed that behaviour using different datasets but to debug it, in this example, I'll show you the results of using mergeit on the same datasets and merge it with itself (only changing the name of the individuals).

My datasets are as follow:

.snp file for both datasets

Affx-34462167   1       0.0     567137  X       X
rs3094315       1       0.0     752566  G       A
rs12562034      1       0.0     768448  X       X
rs12124819      1       0.0     776546  A       G
1_777122        1       0.0     777122  A       T
1_832756        1       0.0     832756  T       G
rs28765502      1       0.0     832918  T       C
1_834832        1       0.0     834832  G       C
1_838555        1       0.0     838555  C       A
1_838931        1       0.0     838931  A       C
rs7419119       1       0.0     842013  T       G
1_843249        1       0.0     843249  X       X
1_843251        1       0.0     843251  X       X
1_846543        1       0.0     846543  G       T
rs950122        1       0.0     846864  G       C
1_851204        1       0.0     851204  G       C
1_852443        1       0.0     852443  A       C
1_852964        1       0.0     852964  T       G
1_853954        1       0.0     853954  C       A
1_854978        1       0.0     854978  A       C

.geno file for both datasets

90990
90002
90999
02992
90992
92200
92090
92000
92900
92922
22090
92202
99229
92922
22220
99020
99992
90900
90922
90022

.ind for dataset1

BWA_Goyet       U       BWA
BWA_Malta       U       BWA
BWA_Vestonice   U       BWA
BWA_Tianyuan    U       BWA
BWA_Yana        U       BWA

.ind for dataset2

BWA2_Goyet      U       BWA
BWA2_Malta      U       BWA
BWA2_Vestonice  U       BWA
BWA2_Tianyuan   U       BWA
BWA2_Yana       U       BWA

Then I merge those 2 dataset using this script:

mergeit -p <(echo "geno1: ${in1}.geno
snp1:  ${in1}.snp
ind1:  ${in1}.ind
geno2: ${in2}.geno
snp2:  ${in2}.snp
ind2:  ${in2}.ind
genooutfilename:   ${out}.geno
snpoutfilename:    ${out}.snp
indoutfilename:    ${out}.ind
outputformat:  PACKEDANCESTRYMAP
docheck: YES
strandcheck:  YES
hashcheck: NO")

And here is output .snp file:

rs3094315     1        0.007526          752566 G A
rs12124819     1        0.007765          776546 A G
1_832756     1        0.008328          832756 T G
rs28765502     1        0.008329          832918 T C
1_838555     1        0.008386          838555 C A
1_838931     1        0.008389          838931 A C
rs7419119     1        0.008420          842013 T G
1_846543     1        0.008465          846543 G T
1_852443     1        0.008524          852443 A C
1_852964     1        0.008530          852964 T G
1_853954     1        0.008540          853954 C A
1_854978     1        0.008550          854978 A C

As expected the resulted .snp file does not have the snps with a 'X' allele. However if you look at position '777122' or '834832', they are not merging even thought they should.

I found this behaviour using:

Do you know where this can come from ?

Thanks a lot, Adrien

kristjanmoore commented 3 years ago

See the documentation under CONVERTF/:

strandcheck: If set to YES, then check that the allele strand is the same in both data sets -- if they are different (e.g A/C vs. T/G), then flip genotype data appropriately. (Note that if strandcheck is set to YES, then all A/T and C/G SNPs will be removed because it is impossible to know whether the allele strand is the same in both data sets. On the other hand, if strandcheck is set to NO, then A/T and C/G SNPs will be retained since it is assumed that both data sets are on the same strand.)

bumblenick commented 3 years ago

You are merging with strandcheck: YES and A/Tor C/G snps will be dropped as mergeit cannot determine if the alleles are ion the same strand..

This is a feature not a bug. I

On Mon, Apr 12, 2021 at 8:46 AM Adrien Oliva @.***> wrote:

Hi, I have noticed that behaviour using different datasets but to debug it, in this example, I'll show you the results of using mergeit on the same datasets and merge it with itself (only changing the name of the individuals).

My datasets are as follow:

.snp file for both datasets

Affx-34462167 1 0.0 567137 X X

rs3094315 1 0.0 752566 G A

rs12562034 1 0.0 768448 X X

rs12124819 1 0.0 776546 A G

1_777122 1 0.0 777122 A T

1_832756 1 0.0 832756 T G

rs28765502 1 0.0 832918 T C

1_834832 1 0.0 834832 G C

1_838555 1 0.0 838555 C A

1_838931 1 0.0 838931 A C

rs7419119 1 0.0 842013 T G

1_843249 1 0.0 843249 X X

1_843251 1 0.0 843251 X X

1_846543 1 0.0 846543 G T

rs950122 1 0.0 846864 G C

1_851204 1 0.0 851204 G C

1_852443 1 0.0 852443 A C

1_852964 1 0.0 852964 T G

1_853954 1 0.0 853954 C A

1_854978 1 0.0 854978 A C

.geno file for both datasets

90990

90002

90999

02992

90992

92200

92090

92000

92900

92922

22090

92202

99229

92922

22220

99020

99992

90900

90922

90022

.ind for dataset1

BWA_Goyet U BWA

BWA_Malta U BWA

BWA_Vestonice U BWA

BWA_Tianyuan U BWA

BWA_Yana U BWA

.ind for dataset2

BWA2_Goyet U BWA

BWA2_Malta U BWA

BWA2_Vestonice U BWA

BWA2_Tianyuan U BWA

BWA2_Yana U BWA

Then I merge those 2 dataset using this script:

mergeit -p <(echo "geno1: ${in1}.geno

snp1: ${in1}.snp

ind1: ${in1}.ind

geno2: ${in2}.geno

snp2: ${in2}.snp

ind2: ${in2}.ind

genooutfilename: ${out}.geno

snpoutfilename: ${out}.snp

indoutfilename: ${out}.ind

outputformat: PACKEDANCESTRYMAP

docheck: YES

strandcheck: YES

hashcheck: NO")

And here is output .snp file:

rs3094315 1 0.007526 752566 G A

rs12124819 1 0.007765 776546 A G

1_832756 1 0.008328 832756 T G

rs28765502 1 0.008329 832918 T C

1_838555 1 0.008386 838555 C A

1_838931 1 0.008389 838931 A C

rs7419119 1 0.008420 842013 T G

1_846543 1 0.008465 846543 G T

1_852443 1 0.008524 852443 A C

1_852964 1 0.008530 852964 T G

1_853954 1 0.008540 853954 C A

1_854978 1 0.008550 854978 A C

As expected the resulted .snp file does not have the snps with a 'X' allele. However if you look at position '777122' or '834832', they are not merging even thought they should.

I found this behaviour using:

  • Unix system / Eigensoft installer through Conda / mergeit version “2450”
  • MacOS system / Eigensoft installed through GitHub / mergeit version "2600"

Do you know where this can come from ?

Thanks a lot, Adrien

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/DReichLab/EIG/issues/64, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEE77B5NPOVCNB5GUKBJ75TTILTQ5ANCNFSM42ZHDC6A .