WGLab / NanoCaller

Variant calling tool for long-read sequencing data
MIT License
90 stars 8 forks source link

Not calling variant present in bam #7

Open Mailinnia opened 3 years ago

Mailinnia commented 3 years ago

I have used your tool on one of my bam files. However, I am wondering why it isn't calling a variant I can clearly see present in the file in IGV. There is plenty of coverage, and few indels in the reads at this position.

image

I find it in the snp_stats, but I do not understand why it is not output in the snps.vcf: pos,ref,prob_GT,prob_A,prob_G,prob_T,prob_C,DP,freq 42131531,G,0.9530,0.1847,0.9624,0.0013,0.0008,111,0.3063

I'm running the following command: python ../NanoCaller/scripts/NanoCaller.py -bam gene.sort2.bam -ref hg38_genome.fasta -prefix output -chrom Chr_22 -start 42076077 -end 42176157 --disable_whatshap -sup

umahsn commented 3 years ago

It is present in the snp_stats file because it was picked up as a candidate site, but not included in the vcf file because it was determined to be false positive. NanoCaller calculated probability of presence of A base =0.1847 which is too low for a variant call. Are you using Nanopore reads or PacBio? It might help to zoom out on IGV to see the surrounding 1-2000 bp for a better understanding of why this was regarded as false positive.

We are planning to release an update which allows you to get a different snapshot of the bam file than IGV, similar to the one in Fig1 of our biorxiv paper. It will show you only the high alternative allele frequency sites and skips other bases, and this would allow you to see if an allele might be false or not.

Mailinnia commented 3 years ago

If I filter for only primary read mappings, then it calls the variant fine: snps.vcf:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE

Chr_22 42131531 . G A 38.810 PASS . GT:DP:FQ 0/1:45:0.4444

snp_stats: pos,ref,prob_GT,prob_A,prob_G,prob_T,prob_C,DP,freq 42131531,G,0.9283,0.5908,0.9420,0.0000,0.0000,45,0.4444

igv_snapshot

I'm trying to understand why it calculates the probability of presence of A base to be so low when the supplementary reads are included. I'm guessing including the supplementary reads introduces too much 'noise' in the surrounding area?

I'm using ONT data that has been corrected with Netcat. I just wanted to compare the variant callings with and without error correction.