Open dereneaton opened 7 years ago
I am also having a similar issue but for SE RAD data. For some reason, I have heterozygous genotypes in my VCF with base counts for only one nucleotide:
0/1:34:0,34,0,0
The proportion of these weird genotypes is rather small on this particular VCF, but I don't know if this site should be considered homozygous or heterozygous.
I can confirm this, on Single end GBS data. I have found a BLAST hit to a mitochondrial sequence, so I was naturally expecting no heterozygous on this locus. However, since there were some heterozygous in there, I went to the VCF file and found some "legitimate" heterozygous (individuals 11 and 13, for example) and some inconsistently called heterozygous genotypes (the last individual, for example):
18107 30 . C T 13 PASS . GT:DP:CATG 0/0:8:8,0,0,0 0/0:41:41,0,0,0 0/0:105:104,0,0,1 0/0:60:60,0,0,0 1/1:45:0,0,0,45 0/0:9:9,0,0,0 0/0:18:18,0,0,0 0/0:62:62,0,0,0 0/0:39:39,0,0,0 0/0:152:152,0,0,0 1/0:74:27,0,47,0 ./.:0:0,0,0,0 1/0:56:32,0,24,0 1/1:15:15,0,0,0 1/0:66:66,0,0,0 1/1:64:1,0,62,1 0/0:82:82,0,0,0 ./.:0:0,0,0,0 0/0:91:91,0,0,0 0/0:19:0,0,0,19 0/0:12:12,0,0,0 1/0:96:96,0,0,0 0/0:10:10,0,0,0 0/0:37:37,0,0,0 0/0:35:35,0,0,0 0/0:67:67,0,0,0 1/0:10:10,0,0,0 0/0:14:0,0,0,14 0/0:17:0,0,0,17 1/1:25:25,0,0,0 ./.:0:0,0,0,0 ./.:0:0,0,0,0 0/0:54:0,54,0,0 1/1:36:0,0,35,1 0/0:58:58,0,0,0 0/0:44:44,0,0,0 0/0:122:122,0,0,0 0/0:55:55,0,0,0 0/0:16:16,0,0,0 0/0:83:83,0,0,0 0/0:83:0,83,0,0 0/0:36:36,0,0,0 0/0:41:41,0,0,0 0/0:29:29,0,0,0 1/0:64:64,0,0,0 0/0:36:36,0,0,0 1/0:94:0,94,0,0 0/0:10:0,0,0,10 0/0:69:69,0,0,0 0/0:79:79,0,0,0 0/0:53:53,0,0,0 1/0:46:46,0,0,0 0/0:45:45,0,0,0 0/0:130:130,0,0,0 0/0:10:10,0,0,0 0/0:16:16,0,0,0 ./.:0:0,0,0,0 0/0:11:11,0,0,0 0/0:9:9,0,0,0 0/0:43:43,0,0,0 0/0:19:19,0,0,0 0/0:43:43,0,0,0 0/0:16:16,0,0,0 0/0:36:36,0,0,0 0/0:45:45,0,0,0 0/0:90:0,89,1,0 1/0:119:119,0,0,0 0/0:20:20,0,0,0 1/0:91:44,0,47,0 0/0:42:42,0,0,0 1/0:157:106,0,50,1 ./.:0:0,0,0,0 0/0:65:65,0,0,0 ./.:0:0,0,0,0 1/0:91:91,0,0,0 ./.:0:0,0,0,0 1/0:64:64,0,0,0 0/0:42:42,0,0,0 0/0:70:0,70,0,0 0/0:42:42,0,0,0 1/0:85:85,0,0,0
Look at the last individual, for example: It is reported as an heterozygous genotype (1/0), but base counts show 85,0,0,0
.
This is unfortunately occurring on approximately 10% of my genotypes - at least this kind of error. I did not spot any reverse error (heterozygous base counts being reported as homozygous), but they might be around.
For reference, here is the corresponding section from the .loci
file.
Arg16 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTG-----
Arg19 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Arg4 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGAT---
Arg5 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Bul11 ACCAAATGTATAGGCCGCAGGAAACATGGTGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Bul16 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTG-----
Bul37 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Bul5 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Cat15 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Cat3 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Cat6 ACCAAATGTATAGGCCGCAGGAAACATGGYGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Cat8 ACCAAATGTATAGGCCGCAGGAAACATGGYGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATG--
Cor11 ACCAAATGTATAGGCCGCAGGAAACATGGTGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATG--
Cor12 ACCAAATGTATAGGCCGCAGGAAACATGGYGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Cor14 ACCAAATGTATAGGCCGCAGGAAACATGGTGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGA----
Cor17 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Cor7 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATG--
Haz10 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Haz16 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTG-----
Haz2 ACCAAATGTATAGGCCGCAGGAAACATGGYGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Haz24 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTG-----
Haz9 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Ken11 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Ken6 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Ken9 ACCAAATGTATAGGCCGCAGGAAACATGGYGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTG-----
Lan10 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Lan11 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Lan14 ACCAAATGTATAGGCCGCAGGAAACATGGTGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Mon12 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Mon13 ACCAAATGTATAGGCCGCAGGAAACATGGTGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Mon26 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Mon9 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Pug10 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Pug12 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Pug5 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTG-----
Pug7 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Pug8 -CCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Pug9 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Sar15 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGA----
Sar16 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Sar23 ACCAAATGTATAGGCCGCAGGAAACATGGYGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Sar4 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Sar7 ACCAAATGTATAGGCCGCAGGAAACATGGYGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Sar9 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Sic13 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Sic4 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Sic5 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Sin21 ACCAAATGTATAGGCCGCAGGAAACATGGYGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Sin33 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Sin6 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Taz17 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTG-----
Taz2 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGAT---
Taz8 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTG-----
Tol17 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATT---------
Tol27 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Tol3 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Tol4 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Tol6 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTG-----
Tos12 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Tos15 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Tos22 -CCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Tos3 ACCAAATGTATAGGCCGCAGGAAACATGGYGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Tos7 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTG-----
Tos8 ACCAAATGTATAGGCCGCAGGAAACATGGYGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGAT---
Tun1 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGC-
Tun23 ACCAAATGTATAGGCCGCAGGAAACATGGYGCRGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTA--------
Tun6 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Tun9 ACCAAATGTATAGGCCGCAGGAAACATGGYGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Var16 ACCAAATGTATAGGCCGCAGGAAACATGGYGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Var17 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Var19 -CCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Var21 ACCAAATGTATAGGCCGCAGGAAACATGGCGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
Var4 ACCAAATGTATAGGCCGCAGGAAACATGGYGCGGTCAGGCAGGGTGTGCTTGAATATGATGCTGGTATTATTGATGCA
// * - |18106|
The .loci
file also shows the last individual as an heterozygote, which is consisten with the vcf genotype. This means that is the error is in the genotype calling and not the base counts, it's coming downstream of this (Im not sure how to post the hdf5
file here).
This was done on ipyrad v0.5.15
.
I just re-ran using ipyrad
0.6.8 and the issue remains...
I have not been able to reproduce this for either SE RAD or PE ddRAD on real data. @StuntsPT Can you dropbox me the hdf5 file? Or better a chunk of the raw data and the params file you used.
@isaacovercast I sent you a link to the hdf5 file via gitter private chat. This is SE GBS data.
tldr;: Yes the genotype counts are wrong in the vcf file, but the genotypes themselves are correct. Heterozygous sites ARE correct even if the counts are wrong.
Got good news and bad news. The good news is the consens.hdf5 file is intact and looking good. The other good news is this means that all vcf files we've been creating up to now are still ok, in terms of the actual genotypes. The bad news is the output of make_vcf must be fscking something and I'm not super familiar with that code so it might take some time to fix. Still not sure exactly what's happening or why his is doing this, but here's the output for this snp for two individuals, from the vcf output above the "-11" individual has the good catg info and the "16" individual has the homozygous looking catg:
>>> io5["catgs"][37200][-11][34]
array([106, 0, 50, 1], dtype=uint32)
>>> io5["catgs"][37200][16][34]
array([47, 0, 19, 0], dtype=uint32)
>>> io5["seqs"][37200][16][34]
'Y'
>>> io5["seqs"][37200][-11][34]
'Y'
The "bad" individual has the incorrect genotype counts (" [66, 0, 0, 0]") in the neighborhood of the bad base:
>>> io5["catgs"][37200][16][30:40]
array([[ 0, 66, 0, 0],
[ 0, 0, 66, 0],
[ 0, 0, 0, 66],
[ 0, 0, 0, 66],
[47, 0, 19, 0],
[ 0, 0, 1, 65],
[66, 0, 0, 0],
[ 0, 0, 0, 66],
[ 0, 0, 0, 66],
[ 0, 0, 66, 0]], dtype=uint32)
A different individual with the bad base has the hetero genotype counts farther away.
>>> catg[82][34:49]
array([[83, 0, 0, 0],
[83, 0, 0, 0],
[83, 0, 0, 0],
[ 0, 0, 83, 0],
[ 0, 0, 0, 83],
[84, 0, 0, 0],
[84, 0, 0, 0],
[ 0, 0, 84, 0],
[ 0, 0, 0, 84],
[ 0, 84, 0, 0],
[84, 0, 0, 0],
[84, 0, 0, 0],
[ 0, 0, 0, 84],
[84, 0, 0, 0],
[ 0, 27, 0, 57]], dtype=uint32)
Another sample with a bad base (the hetero genotype is base 34, recall:
>>> catg[23][34:49]
array([[96, 0, 0, 0],
[97, 0, 0, 0],
[97, 0, 0, 0],
[ 1, 0, 96, 0],
[ 1, 0, 0, 96],
[97, 0, 0, 0],
[97, 0, 0, 0],
[ 0, 0, 97, 0],
[ 0, 0, 0, 97],
[ 0, 97, 1, 0],
[99, 0, 0, 0],
[99, 0, 0, 0],
[ 0, 0, 0, 99],
[99, 0, 0, 0],
[ 0, 43, 0, 56]], dtype=uint32)
~The .loci writer may be acting up too, since the issue is also revealed there...~ Sleep deprivation led to this. Forget it as it made no sense.
I haven't had a chance to look at this yet, but one idea is that the problem may be related to indels, which are imputed into the catg array after step6 aligning. If an indel is not properly imputed then the base counts would not line up properly with the base calls.
Also NB: the actual cluster number for this locus inside the clust.hdf5 file is 37200, and not 18106
as reported in the .loci file, this took me a while to figure out. 32700 is a conspicuously round number, perhaps an optim
edge case? Though my instinct was something about indels as well.
That's great news. It means I can carry on with my analyses! The read counts are not vital, I was just worried that the genotypes could be incorrectly called.
Just for the record, I have provided via private gitter.im message the required sample files to reproduce the issue.
omg how is this still an issue? I had hoped the update to v0.9 had fixed this, but apparently not.
This was reported on gitter and I got the user to share data necessary to reproduce:
RAD_0 6 loc0_pos5 C T 13 PASS NS=114;DP=3165 GT:DP:CATG 0/0:20:20,0,0,0 0/0:44:0,0,44,0 0/0:16:16,0,0,0 0/0:25:0,0,25,0 0/0:21:0,0,21,0 0/0:15:15,0,0,0 0/0:8:0,0,8,0
There appears to be no rhyme or reason to the depth issue. It's not a sample issue because I looked at 2 different samples and in both of them sometimes the catg
s line up with the sequence and sometimes they don't. It's not locus specific because within any given locus I observe samples with both good and bad catg data. It appears unrelated to indels, because I can see sometimes a sample at a locus with tons of indels still has correct catg info, and sometimes a sample at a locus with complete sequence (except 2 indels at the end) has bad catg info.
Simulated data appears unaffected (naturally).
Pedicularis data also seems unaffected. I checked about 10 random sequences for three different samples and they were all fine.
So, in consens_se.py the catgs are calculated from self.arrayed
on line 709. The sequence that gets written out comes from self.consens
. self.arrayed
is created inside build_consens_and_array
and then self.consens
is generated from arrayed by base_caller
, then a bunch of other stuff happens, and somewhere in there arrayed
and consens
diverge. Here are all the places arrayed gets manipulated:
Line 611:
self.consens = self.consens[ltrim:rtrim + 1]
self.arrayed = self.arrayed[:, ltrim:rtrim + 1]
Line 637 (inside mask_repeats
):
# apply filter
self.consens = self.consens[keep]
self.arrayed = self.arrayed[:, keep]
This all looks legit.
storealleles
manipulates consens
, but I can see that this problem crops up both for sequences with 1 and with 2 alleles, so it can't be this.
FUCK! I cracked it. It's GBS.
Here is a bad base:
RAD_9 8 loc9_pos7 G T 13 PASS NS=2;DP=26 GT:DP:CATG 1/1:9:0,0,9,0 0/0:17:0,0,17,0 ./.:0:0,0,0,0 ./.:0:0,0,0,0
Here is the locus in the clust_database.fa:
>S197batch2_10079
TAATATTTCTTATAGTTGTGTCAGATTAGAACAATCTGCTTGTTGTTTGGTGTAGTTATGACTTCAGAAACATAGAAGAAGACAGGCAGTTGCATTTGTTTTGTCAGCAATTGTCTGGCAGTTTCTAGTTGGACAGAGACTGATnnnnGCAGTTTCTAGTTGGACAGAGACTGATTAGAAGAATCTGCTTGTTGTTTGGTATAGTTATAACTTCAGAAACTAAGAAGAAGATAGGCAGTTGCATTTGTTTTGTCAGCCATTTTCTGGCAGTTTGTAGCTGGACGGAGACTTA
>S226zAbatch2_6418
TAATATTTCTGATAGTTGTGTCAGATTAGAACAATCTGCTTGTTGTTTGGCGTAGTTATGACTTCAGAAACATAGAAGAAGACAGGCAGTTGCATTTGTTTTGTCAGCAATTGTCTGGCAGTTTCTAGTTGGACAGAGACAGATnnnnGCAGTTTCTAGTTGGACAGAGACAGATTAGAAGAATCTGCTTGTTGTTTGGTATAGTTATAACTTCAGAAACTAAGAAGAAGATAGGCAGTTGCATTTGTTTTGTCAGCCATTTTCTGGCAGTTTGTAGCTGGACGGAGACTTA
Here are the first 17 bases from the catg for S197batch2 (matches the proximal end of the sequence in the clust db):
[[0 0 9 0]
[0 9 0 0]
[0 9 0 0]
[0 0 9 0]
[0 9 0 0]
[0 0 9 0]
[0 0 9 0]
[0 0 9 0]
[9 0 0 0]
[0 0 8 1]
[0 0 9 0]
[0 9 0 0]
[0 0 9 0]
[0 9 0 0]
[0 0 0 9]
[0 0 9 0]
[0 0 9 0]]
Here are the first 17 bases from the catg for S226zAbatch2 (revcomp matches the last 17 bases of the distal end of the sequence in the clust db):
[[ 0 0 17 0]
[ 0 17 0 0]
[ 0 17 0 0]
[ 0 0 0 17]
[ 0 0 17 0]
[17 0 0 0]
[ 0 0 17 0]
[17 0 0 0]
[17 0 0 0]
[ 0 0 0 17]
[ 0 0 17 0]
[17 0 0 0]
[17 0 0 0]
[ 0 17 0 0]
[ 0 0 0 17]
[17 0 0 0]
[ 0 0 17 0]]
Still don't know how to fix it, but damn, i'm super happy to have figured it out and it seems like it might be an easy fix.....
SHIIIIIIIIITTTTTTTTTTTTTTTTTTTTTTTTT! It's not easy to fix. In clustmap_across.py GBS data gets clustered on both strands, so sequences can 'flip' to match the strand of the seed, which obviously flips the orientation of the sequence and breaks the connection with the catg info constructed during step 5. Damn.
Essentially the depth info is getting fsck for GBS data because the catg depths are recorded during step 5 and then during clustering in step 6 the consensus sequences can get randomly re-oriented to match the strand of the seed, which breaks the connection with the orientation of the depths in the catg file.
The fix for this will be to parse the utemp file during step 6 and step through each catg file for each sample and reverse the depth info for hits that match on the negative strand. This only needs to be done for GBS data, and probably should happen at the end of step 6. This will be super annoying, and a lot of work, so I will file this under "will fix when someone pays me to do it." ;p
Reported on gitter.im
denovo
I also get malformed CATG fields (missing count): locus_17982 155 . A C 13 PASS NS=12;DP=12794