WGLab / NanoCaller

Variant calling tool for long-read sequencing data
MIT License
90 stars 8 forks source link

Unnormal results ! #37

Open AzizHN opened 1 year ago

AzizHN commented 1 year ago

Hello,

I have some ONT reads of Aspergillus fumigatus that I mapped against a reference sequence.

I am using Nanocaller for variants calling, My input files are a BAM file (555,1 Ko; 85 sequences mapped) and my ref is a fasta file (3,6 Ko, length = 1200pb)

The command that I am using is :

NanoCaller --mode snps --sequencing ont --haploid_genome --bam sorted_mapped.bam --ref gene.fna

I have few questions:

1/ Is it normal that my VCF output file contains only 8 snps ! Since, as I know the VCF file should mention all variations in all sequences compared to the ref seq, and when I have a look through IGV, I have more than 8 snps. (I think the tool mentioned only the variations that are present in almost all sequences! )? I tried to use other flags as --min_allele_freq to precise that the minimum allele frequency should be 0,01 and --mincovto precise that coverage of 1 is good to call a variant, but nothing changed, always 8 variants !

2/ Is there any way to perform some performance analysis on the tool? I was thinking to use a true dataset containing some known variants, but It seems that I didn't find the one until now.

I will be very grateful if you can help.

Thanks in advance!

have a good day!

AzizHN commented 1 year ago

@umahsn can you please help me in that if you have some time, I will be very grateful

umahsn commented 1 year ago

Hi, yes I can help look into it. Is the dataset you are analyzing publicly available? If so, I can try to replicate the variant calling pipeline and see what's going on.

umahsn commented 1 year ago

If you could let me know what reference you are using, nanopore dataset as well as how you are aligning the reads that would be helpful.

umahsn commented 1 year ago

One thing I'd recommend is that if you are using a microbiome sample, it might be better to not use --haploid_genome mode. Because even though the genome itself is haploid, if the sample is taken from a microbial community with a mixture of potentially different strains, then it becomes a slightly different question and falls into the realm of metagenomics. The haploid model of NanoCaller was trained on CHM13, a haploid cell line, and is most suitable for samples that come from cell culture or from haploid chromosomes of a single organism, basically under the assumption that there should only be a single haplotype and the model is forced to choose between homozygous reference and homozygous alternative. If you have a mixture of strains, then the sample is somewhat "polyploid" in the sense that there are multiple haplotypes present, hence metagenomics. When you run NanoCaller under haploid model, then it will tend to report those variants that are confidently supporting a single haplotype assumption. In this case, I'd recommend using NanoCaller without --haploid_genome parameter, but I would warn you that we have not evaluated accuracy of NanoCaller in metagenomic variant calling.

It might also be suitable to use a metagenomics pipeline, and there are some datasets outlined here for potentially benchmakring nanopore tools for microbial datasets: https://academic.oup.com/gigascience/article/8/5/giz043/5486468

You can also check https://academic.oup.com/gigascience/article/9/2/giaa007/5728470 for SNP pipelines for microbial datasets.

AzizHN commented 1 year ago

Hello @umahsn , thanks for your reply. Indeed, I am working on metagenomics ONT data, but when it comes to variant calling I only extracted sequences assigned to Aspergillus fumigatus (so we can say it is very similar to reads coming from a single organism) ... I mapped those sequences with minimap2 using as a reference the sequence of the CYP51A gene (1200 bp). From the SAM file, I created a BAM, then I extracted only mapped reads to the ref (83 sequences), the BAM then is sorted and indexed. Can you please send me your email so I can send you the final BAM file and the ref seq, or you can contact me directly at "aziznasr1920@gmail.com" and I will send you the files.

Thank you in advance.

umahsn commented 1 year ago

Yes I can take a look at the datasets. I have reached out via email.

umahsn commented 12 months ago

Hi,

I ran NanoCaller on this sample under the haploid model, and there was an example of a clear variant at (137 NC_007197.1:c1781822-1780204) with >97% allele frequency that was missed by NanoCaller. When I looked at the intermediate probability output, the reference base had low confidence, but was still the highest probability prediction. I believe this was an error on part of

However, when I used the diploid model, this variant was correctly identified. There were some other potential variant sites with <25% allele frequency but NanoCaller diploid model did not call those. The reason for that is that NanoCaller did not see any evidence for any other variants co-occurring on the same reads as these variants which can be tested by looking at an IGV plot as well. This indicates that such mismatches are likely to be errors, however this is still with the assumption of a diploid organism. If you are interested in just getting a list of sites with variant allele frequency in a certain range, such as 1-25% or something like that, it is possible to use samtools pileup. NanoCaller previously (v<3.0) did produce file containing a list of all sites with allele frequency above min_allele_freq threshold, but I removed it in the current version. I will look into reintroducing that feature.

umahsn commented 11 months ago

Hi,

Just wanted to update that I have added more details into the VCF files, including for candidate sites that were determined to not be true variants by NanoCaller. There is a "unfiltered" VCF file for SNPs that contains these extra sites. In the INFO field, you can see the probability that each base is a true allele. These have only been implemented for the diploid model, which I think is more suitable for your scenario since it estimates probability of occurence of each base independently. There seems to be few problems with the haploid model which I will fix, but I am also working on releasing a new model for metagenomes, which might take a couple months to get released.