NGSEP / NGSEPcore

NGSEP is an integrated framework for analysis of high throughput sequencing (HTS) reads. The main functionality of NGSEP is the variants detector, which allows to make integrated discovery and genotyping of Single Nucleotide Variants (SNVs), insertions, deletions, and genomic regions with copy number variation (CNVs).
GNU General Public License v3.0
47 stars 19 forks source link

Filtering variants from SingleSampleVariantsDetector #39

Closed Liukvr closed 3 years ago

Liukvr commented 3 years ago

Dear NGSEP developers,

Thanks for this great tool. I'm using the SingleSampleVariantsDetector module to identify SNV in pool-seq data as well as single sample dataset. I was wondering if is there a suggested threshold to filter the output variants using the QUAL field in the vcf file.

Thanks in advance, Luca

jduitama commented 3 years ago

Hi Luka

Thanks for your interest in NGSEP. Since yo are running the single sample command, the minimum quality filter (-minQuality) will apply almost equally to both the QUAL column and the GQ genotype field (the difference is usually small). Further filtering options can be executed with the VCFFilter command. based on our previous benchmark experiments, 40 makes a good threshold to balance sensitivity and specificiy. Further information specific to pooled samples can be found in our TILLING paper (https://doi.org/10.3389/fgene.2021.624513).

Best regards

Jorge

Liukvr commented 3 years ago

Dear Jorge,

thanks for the reply and for the suggestion. Regarding the filtering of low quality variants, i noticed that NGSEP do not filter for base quality when looking for reads supporting the alternative allele, is that right? In the paper you mentioned the command line used for NGSEP included the -MaxBaseQS paramenter. Is it not clear to me what is the effect of this parameter, is there a way to adjust the minimum base quality for variant calling?

Thanks again, Luca

jduitama commented 3 years ago

Dear Luka

No problem. About the base quality you are right. We do not explicitly filter base calls with low raw quality but these get less weight in the Bayesian model. This allows to still use a large number of good base calls that receive relatively low confidence by the primary analysis. Conversely, we have the parameter -maxBaseQS, which really serves as an upper bound on how much we trust the values of base quality scores. We added this parameter mainly because the main source of error for the Bayesian model was a bad base call with high base quality score. This can happen if the error occurs before the sequencing process. Although the default value is 100 (in practice no filer), our benchmark experiments indicate that values around 30 make a good compromise between sensitivity and specificity.

The problem of setting a plain minimum base quality score or setting the -maxBaseQS to a very low value is that the model becomes very simple and relies solely on base counts. Our experiments indicate that the integration of the information provided by base quality scores (even the low ones), running models such as the Bayesian model we have implemented in NGSEP, actually increases accuracy of SNP discovery and genotyping.

Best regards

Jorge