Closed robertwhbaldwin closed 2 years ago
I should add here's the header to the DiscoSnp VCF. It does go REF and then ALT
I've been comparing the VCF files further. There's definitely some concordant snps (same chrom, coordinate, ref, and alt) so when records do overlap between the two files the REF is generally being reported the same way (the alt as well). So I'm still confused why some records have different REFs in the two files at the same chrom and position (The alt is different as well in this case though it doesn't necessarily need to be the same).
Dear Robert,
Thanks for your question and sorry for this late answer, yet simple. DiscoSnps finds variants with no reference. In the example you provide, it found from reads only variant A vs G at position 799357. To respect the vcf format we had to re-use the ALT & REF fields but this is meaningless with no reference genome.
In the INFO field, you also find Genome=C
which indicates that at this position the genome contains a C, as revealed by the TRUTH vcf.
Note that if this nucleotide corresponds to one from the SNP, then it is chosen as REF.
Note also that you may use the genome as one of the input sets (then not used only as a reference for generating the VCF file). This is the option -R
I hope this is clear. Pierre
Nope, still confused. You mapped the contigs to the reference with bwa and then located the variants on the reference and reported the coordinates. But some variants could not be reported this way because they do no map to the ref. So you put them in the VCF anyways ? How can a variant have no reference but then be reported in the VCF with a chrom and position? It makes no sense.
DiscoSnp finds variants with no reference genome, directly from reads only. This creates a pair of sequences for each variant (see publications). Then it uses the genome to localize their positions. This explains how we can be in the situation you describe.
Here is a fake representation of what happens in your case:
So you're saying that the ref being reported will ONLY match the reference when the reference allele was present in the reads?
yes, unless using the -R option
ok thanks I think that I understand now
Hello, Thanks for this great tool I just have a concern.
I ran DiscoSNP using the following command:
Running discoSnp++ 2.3.X, in directory /home/robert/tools/DiscoSnp with following parameters: read_sets=mahi_file.list prefix=discoRes_k_31_c_3 c=3 C=2147483647 k=31 b=0 d=1 D=40 s= P=3 p=discoRes G=/home/robert/assembly/MAHI_genome.fasta e= starting date=Thu 28 Oct 2021 09:15:24 AM ADT
I also generated a second VCF by mapping the fasta to the same reference as follows:
ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -M -t 10 -v 3 -R @RG\tID:F003\tLB:L001\tSM:F003\tPL:ILLUMINA /home/robert/assembly/MAHI_genome.fasta.gz /data/MAHI/F_003_i/F-003-i_S82_L002_R1_001.fastq.gz /data/MAHI/F_003/F-003-_S82_L002_R2_001.fastq.gz
The only difference was that for the VCF generated by read mapping the assembly was bgzipped.
I then wanted to do some concordance between the VCFs using the VCF generated by read mapping as TRUTH and the DiscoSnp VCF as the CALL set.
AT this point I noticed that for some records the ref alleles do not match. For example, here's a TRUTH VCF record:
ptg000002l 799357 . C A 4761.33 PASS AB=0.548571;ABP=6.59633;AC=28;AF=0.33;AN=86;AO=176;CIGAR=1X;DP=511;DPB=511;DPRA=0.994422;EPP=5.42853;EPPR=44.2023;GTI=2;LEN=1;MEANALT=1.08;MQM=60;MQMR=60;NS=49;NUMALT=1;ODDS=0.545245;PAIRED=0.988636;PAIREDR=0.99696;PAO=0;PQA=0;PQR=0;PRO=0;QA=6338;QR=11961;RO=329;RPL=81;RPP=5.42853;RPPR=3.54492;RPR=95;RUN=1;SAF=106;SAP=19.0002;SAR=70;SRF=192;SRP=22.976;SRR=137;TYPE=snp;technology.ILLUMINA=1 GT:AD:AO:DP:FT:GQ:PL:QA:QR:RO ./.:1,0:0:1:DP4:11:0,3,37:0:37:1 1/1:0,6:6:6:PASS:19:203,18,0:222:0:0 0/1:7,6:6:13:PASS:99:164,0,198:222:259:7 0/1:2,7:7:9:PASS:54:210,0,43:259:74:2 0/0:15,0:0:15:PASS:53:0,45,481:0:531:15 0/0:12,0:0:12:PASS:44:0,36,403:0:444:12 0/1:3,4:4:7:PASS:90:116,0,82:148:111:3 0/0:8,0:0:8:PASS:32:0,24,270:0:296:8 0/0:10,0:0:10:PASS:38:0,30,313:0:344:10 1/1:0,9:9:9:PASS:28:292,27,0:321:0:0 0/1:4,8:8:13:PASS:99:232,0,99:296:148:4 0/0:11,0:0:11:PASS:41:0,33,359:0:395:11 1/1:0,11:11:11:PASS:34:370,33,0:407:0:0 0/0:7,0:0:7:PASS:29:0,21,226:0:247:7 0/1:6,9:9:15:PASS:99:258,0,147:333:210:6 0/0:9,0:0:9:PASS:35:0,27,303:0:333:9 0/1:4,3:3:7:PASS:82:82,0,116:111:148:4 0/0:18,0:0:18:PASS:62:0,54,603:0:666:18 0/1:5,12:12:17:PASS:99:341,0,119:432:185:5 1/1:0,21:21:21:PASS:64:657,63,0:727:0:0 0/0:9,0:0:9:PASS:35:0,27,292:0:321:9 0/0:4,0:0:4:PASS:20:0,12,137:0:148:4 0/0:11,0:0:11:PASS:41:0,33,370:0:407:11 0/0:11,0:0:11:PASS:41:0,33,370:0:407:11 0/0:15,0:0:15:PASS:53:0,45,492:0:543:15 0/0:20,0:0:20:PASS:68:0,60,669:0:740:20 0/1:5,7:7:12:PASS:99:190,0,110:247:159:5 ./.:0,3:3:3:DP4:10:104,9,0:111:0:0 0/1:10,5:5:15:PASS:99:114,0,291:173:370:10 0/1:6,5:5:11:PASS:99:137,0,170:185:222:6 0/0:16,0:0:16:PASS:56:0,48,536:0:592:16 1/1:0,7:7:7:PASS:22:213,21,0:233:0:0 0/1:10,5:5:15:PASS:99:125,0,268:185:344:10 1/1:0,11:11:11:PASS:34:346,33,0:381:0:0 0/1:8,6:6:14:PASS:99:161,0,204:222:270:8 0/0:14,0:0:14:PASS:50:0,42,470:0:518:140/1:4,9:9:13:PASS:99:264,0,98:333:148:4 1/1:0,11:11:16:PASS:0:366,33,0:407:0:0 0/0:11,0:0:11:PASS:41:0,33,370:0:407:11 0/1:2,5:5:7:PASS:57:149,0,49:185:74:2 0/0:9,0:0:9:PASS:35:0,27,292:0:321:9 0/0:10,0:0:10:PASS:38:0,30,337:0:370:10 0/0:9,0:0:9:PASS:35:0,27,303:0:333:9 0/0:15,0:0:15:PASS:53:0,45,492:0:543:15 0/0:5,0:0:5:PASS:23:0,15,170:0:185:5
And here's the DiscoSnp record:
ptg000002l 799357 1589465 A G . PASS Ty=SNP;Rk=0.44587;UL=58;UR=50;CL=396;CR=395;Genome=C;Sd=-1 GT:DP:PL:AD:HQ 0/0:7:5,25,144:7,0:68,0 0/0:6:5,22,124:6,0:70,0 0/0:9:5,31,184:9,0:70,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 0/0:5:4,19,104:5,0:70,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 0/0:9:5,31,184:9,0:68,00/0:10:15,24,174:9,1:70,44 ./.:0:.,.,.:0,0:0,0 0/0:15:5,49,304:15,0:70,0 ./.:0:.,.,.:0,0:0,0 0/0:10:5,34,204:10,0:70,0 ./.:0:.,.,.:0,0:0,0 ./.:3:.,.,.:3,0:70,0 ./.:0:.,.,.:0,0:0,0 0/0:15:5,49,304:15,0:67,0 0/0:22:5,70,444:22,0:68,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 0/0:7:5,25,144:7,0:68,0 ./.:5:.,.,.:5,0:67,0 0/0:5:4,19,104:5,0:70,0 ./.:0:.,.,.:0,0:0,0 0/0:6:5,22,124:6,0:70,0 ./.:5:.,.,.:5,0:70,0 0/0:12:5,40,244:12,0:67,0 0/0:7:5,25,144:7,0:70,0 ./.:0:.,.,.:0,0:0,0 0/0:10:5,34,204:10,0:70,0 0/1:16:52,20,212:12,4:70,70 ./.:0:.,.,.:0,0:0,0 0/0:5:4,19,104:5,0:70,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0
And when I look at the assembly in this region I get this:
bedtools getfasta -fi MAHI_genome.fasta -bed var.bed
There's no A here. The start in the bed is 0-based so the second C is the REF being reported. I can't figure out what the problem is or why DiscoSnp is reporting a REF allele that does not appear to exist.
Any ideas? Thank You. - RObert