GATB / DiscoSnp

DiscoSnp is designed for discovering all kinds of SNPs (not only isolated ones), as well as insertions and deletions, from raw set(s) of reads.
https://gatb.inria.fr/software/discosnp/
GNU Affero General Public License v3.0
38 stars 20 forks source link

DiscoSnp VCF record; Ref allele does not match Assembly Used #26

Closed robertwhbaldwin closed 2 years ago

robertwhbaldwin commented 2 years ago

Hello, Thanks for this great tool I just have a concern.

I ran DiscoSNP using the following command:

Running discoSnp++ 2.3.X, in directory /home/robert/tools/DiscoSnp with following parameters: read_sets=mahi_file.list prefix=discoRes_k_31_c_3 c=3 C=2147483647 k=31 b=0 d=1 D=40 s= P=3 p=discoRes G=/home/robert/assembly/MAHI_genome.fasta e= starting date=Thu 28 Oct 2021 09:15:24 AM ADT

I also generated a second VCF by mapping the fasta to the same reference as follows:

ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -M -t 10 -v 3 -R @RG\tID:F003\tLB:L001\tSM:F003\tPL:ILLUMINA /home/robert/assembly/MAHI_genome.fasta.gz /data/MAHI/F_003_i/F-003-i_S82_L002_R1_001.fastq.gz /data/MAHI/F_003/F-003-_S82_L002_R2_001.fastq.gz

The only difference was that for the VCF generated by read mapping the assembly was bgzipped.

I then wanted to do some concordance between the VCFs using the VCF generated by read mapping as TRUTH and the DiscoSnp VCF as the CALL set.

AT this point I noticed that for some records the ref alleles do not match. For example, here's a TRUTH VCF record:

ptg000002l 799357 . C A 4761.33 PASS AB=0.548571;ABP=6.59633;AC=28;AF=0.33;AN=86;AO=176;CIGAR=1X;DP=511;DPB=511;DPRA=0.994422;EPP=5.42853;EPPR=44.2023;GTI=2;LEN=1;MEANALT=1.08;MQM=60;MQMR=60;NS=49;NUMALT=1;ODDS=0.545245;PAIRED=0.988636;PAIREDR=0.99696;PAO=0;PQA=0;PQR=0;PRO=0;QA=6338;QR=11961;RO=329;RPL=81;RPP=5.42853;RPPR=3.54492;RPR=95;RUN=1;SAF=106;SAP=19.0002;SAR=70;SRF=192;SRP=22.976;SRR=137;TYPE=snp;technology.ILLUMINA=1 GT:AD:AO:DP:FT:GQ:PL:QA:QR:RO ./.:1,0:0:1:DP4:11:0,3,37:0:37:1 1/1:0,6:6:6:PASS:19:203,18,0:222:0:0 0/1:7,6:6:13:PASS:99:164,0,198:222:259:7 0/1:2,7:7:9:PASS:54:210,0,43:259:74:2 0/0:15,0:0:15:PASS:53:0,45,481:0:531:15 0/0:12,0:0:12:PASS:44:0,36,403:0:444:12 0/1:3,4:4:7:PASS:90:116,0,82:148:111:3 0/0:8,0:0:8:PASS:32:0,24,270:0:296:8 0/0:10,0:0:10:PASS:38:0,30,313:0:344:10 1/1:0,9:9:9:PASS:28:292,27,0:321:0:0 0/1:4,8:8:13:PASS:99:232,0,99:296:148:4 0/0:11,0:0:11:PASS:41:0,33,359:0:395:11 1/1:0,11:11:11:PASS:34:370,33,0:407:0:0 0/0:7,0:0:7:PASS:29:0,21,226:0:247:7 0/1:6,9:9:15:PASS:99:258,0,147:333:210:6 0/0:9,0:0:9:PASS:35:0,27,303:0:333:9 0/1:4,3:3:7:PASS:82:82,0,116:111:148:4 0/0:18,0:0:18:PASS:62:0,54,603:0:666:18 0/1:5,12:12:17:PASS:99:341,0,119:432:185:5 1/1:0,21:21:21:PASS:64:657,63,0:727:0:0 0/0:9,0:0:9:PASS:35:0,27,292:0:321:9 0/0:4,0:0:4:PASS:20:0,12,137:0:148:4 0/0:11,0:0:11:PASS:41:0,33,370:0:407:11 0/0:11,0:0:11:PASS:41:0,33,370:0:407:11 0/0:15,0:0:15:PASS:53:0,45,492:0:543:15 0/0:20,0:0:20:PASS:68:0,60,669:0:740:20 0/1:5,7:7:12:PASS:99:190,0,110:247:159:5 ./.:0,3:3:3:DP4:10:104,9,0:111:0:0 0/1:10,5:5:15:PASS:99:114,0,291:173:370:10 0/1:6,5:5:11:PASS:99:137,0,170:185:222:6 0/0:16,0:0:16:PASS:56:0,48,536:0:592:16 1/1:0,7:7:7:PASS:22:213,21,0:233:0:0 0/1:10,5:5:15:PASS:99:125,0,268:185:344:10 1/1:0,11:11:11:PASS:34:346,33,0:381:0:0 0/1:8,6:6:14:PASS:99:161,0,204:222:270:8 0/0:14,0:0:14:PASS:50:0,42,470:0:518:140/1:4,9:9:13:PASS:99:264,0,98:333:148:4 1/1:0,11:11:16:PASS:0:366,33,0:407:0:0 0/0:11,0:0:11:PASS:41:0,33,370:0:407:11 0/1:2,5:5:7:PASS:57:149,0,49:185:74:2 0/0:9,0:0:9:PASS:35:0,27,292:0:321:9 0/0:10,0:0:10:PASS:38:0,30,337:0:370:10 0/0:9,0:0:9:PASS:35:0,27,303:0:333:9 0/0:15,0:0:15:PASS:53:0,45,492:0:543:15 0/0:5,0:0:5:PASS:23:0,15,170:0:185:5

And here's the DiscoSnp record:

ptg000002l 799357 1589465 A G . PASS Ty=SNP;Rk=0.44587;UL=58;UR=50;CL=396;CR=395;Genome=C;Sd=-1 GT:DP:PL:AD:HQ 0/0:7:5,25,144:7,0:68,0 0/0:6:5,22,124:6,0:70,0 0/0:9:5,31,184:9,0:70,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 0/0:5:4,19,104:5,0:70,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 0/0:9:5,31,184:9,0:68,00/0:10:15,24,174:9,1:70,44 ./.:0:.,.,.:0,0:0,0 0/0:15:5,49,304:15,0:70,0 ./.:0:.,.,.:0,0:0,0 0/0:10:5,34,204:10,0:70,0 ./.:0:.,.,.:0,0:0,0 ./.:3:.,.,.:3,0:70,0 ./.:0:.,.,.:0,0:0,0 0/0:15:5,49,304:15,0:67,0 0/0:22:5,70,444:22,0:68,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 0/0:7:5,25,144:7,0:68,0 ./.:5:.,.,.:5,0:67,0 0/0:5:4,19,104:5,0:70,0 ./.:0:.,.,.:0,0:0,0 0/0:6:5,22,124:6,0:70,0 ./.:5:.,.,.:5,0:70,0 0/0:12:5,40,244:12,0:67,0 0/0:7:5,25,144:7,0:70,0 ./.:0:.,.,.:0,0:0,0 0/0:10:5,34,204:10,0:70,0 0/1:16:52,20,212:12,4:70,70 ./.:0:.,.,.:0,0:0,0 0/0:5:4,19,104:5,0:70,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0

And when I look at the assembly in this region I get this:

bedtools getfasta -fi MAHI_genome.fasta -bed var.bed

ptg000002l:799355-799359 CCTC

There's no A here. The start in the bed is 0-based so the second C is the REF being reported. I can't figure out what the problem is or why DiscoSnp is reporting a REF allele that does not appear to exist.

Any ideas? Thank You. - RObert

robertwhbaldwin commented 2 years ago

I should add here's the header to the DiscoSnp VCF. It does go REF and then ALT

bcftools_viewVersion=1.12+htslib-1.12

bcftools_viewCommand=view -v snps copy.vcf; Date=Mon Jan 17 15:44:23 2022

bcftools_viewCommand=view -h snps.vcf; Date=Tue Jan 18 11:43:56 2022

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT G1 G2 G3 G4 G5 G6 G7 G8 G9 G10 G11 G12 G13 G14 G15 G16 G17 G18 G19 G20 G21 G22 G23 G24 G25 G26 G27 G28 G29 G30 G31 G32 G33 G34 G35 G36 G37 G38 G39

robertwhbaldwin commented 2 years ago

I've been comparing the VCF files further. There's definitely some concordant snps (same chrom, coordinate, ref, and alt) so when records do overlap between the two files the REF is generally being reported the same way (the alt as well). So I'm still confused why some records have different REFs in the two files at the same chrom and position (The alt is different as well in this case though it doesn't necessarily need to be the same).

pierrepeterlongo commented 2 years ago

Dear Robert,

Thanks for your question and sorry for this late answer, yet simple. DiscoSnps finds variants with no reference. In the example you provide, it found from reads only variant A vs G at position 799357. To respect the vcf format we had to re-use the ALT & REF fields but this is meaningless with no reference genome.

In the INFO field, you also find Genome=C which indicates that at this position the genome contains a C, as revealed by the TRUTH vcf.

Note that if this nucleotide corresponds to one from the SNP, then it is chosen as REF.

Note also that you may use the genome as one of the input sets (then not used only as a reference for generating the VCF file). This is the option -R

I hope this is clear. Pierre

robertwhbaldwin commented 2 years ago

Nope, still confused. You mapped the contigs to the reference with bwa and then located the variants on the reference and reported the coordinates. But some variants could not be reported this way because they do no map to the ref. So you put them in the VCF anyways ? How can a variant have no reference but then be reported in the VCF with a chrom and position? It makes no sense.

pierrepeterlongo commented 2 years ago

DiscoSnp finds variants with no reference genome, directly from reads only. This creates a pair of sequences for each variant (see publications). Then it uses the genome to localize their positions. This explains how we can be in the situation you describe.

Here is a fake representation of what happens in your case:

screenshot 2022-01-24 à 15 21 06
robertwhbaldwin commented 2 years ago

So you're saying that the ref being reported will ONLY match the reference when the reference allele was present in the reads?

pierrepeterlongo commented 2 years ago

yes, unless using the -R option

robertwhbaldwin commented 2 years ago

ok thanks I think that I understand now