AstraZeneca-NGS / VarDictJava

VarDict Java port
MIT License
127 stars 56 forks source link

VarDict should clearly document that reads without `NM` tags are skipped #51

Closed tfenne closed 7 years ago

tfenne commented 7 years ago

VarDictJava (and presumably VarDict) completely ignore reads that are missing the optional NM tag, but this isn't clearly stated in any documentation, nor are any warnings issued while running. It would be nice if the usage documentation mentioned this, and also if it either logged a warning the first time this happens, or logged the number of missing NM tags at the end of a run.

It would be even nicer if it just recomputed the NM tag on the fly since it has the read and the reference available. There are a number of methods in htsjdk for calculating NM or just the number of mismatches (in SequenceUtil), and since VarDictJava already relies on htsjdk that would be a relatively simple route.

zhongwulai commented 7 years ago

@tfenne Thanks for the suggestion. Can I ask which aligner did you use that didn't have NM tag? When you use option "-y", it should give a warning message indicating the tag is missing, but doesn't stop.

The aligners we tested always produce "NM" tag and I think it should be required for all aligners and it should be aligners' job to produce it. Even though VarDict can recompute it, it's a recompute and the performance will suffer. If an aligner does produce it, you probably should request it.

Meanwhile, you can comment out line 1462 in VarDict.java and recompile and VarDict will not filter out reads without NM tag. In Perl version, that line is 642.

tfenne commented 7 years ago

@zhongwulai - it wasn't an aligner, I was running a tool that performs some custom trimming of the reads post-alignment, and since I didn't think I needed the NM tag (or MD etc.), I dropped those tags since they'd be inaccurate after trimming instead of re-computing them. I've fixed my process now to re-compute NM so it's not an issue for me any longer.

I totally understand not wanting to re-compute those values. That's good to know about the -y option, thanks.

I'd be happy to submit a trivial PR just to add to the standard usage that reads must have the NM tag. I think telling users that up front would be helpful.

zhongwulai commented 7 years ago

@tfenne Great. I don't know what your intention on trimming, but just want to let you know that VarDict has an option "-T INT", which, when specified, will only use the first "INT" bases in the reads. So it should be treated as hard trimming. For example, you have a sequencing run with 250bp but notice there's quality issue after 201, you can simply add option "-T 200" in VarDict. VarDict will then keep only first 200bp, or the whole read if the read length is less than 200bp.

zhongwulai commented 7 years ago

I just pushed in a fix. I'll call variants even if there's no NM tag with no warning message unless option -y is on.

tfenne commented 7 years ago

Thanks @zhongwulai that's great. Just to answer your earlier question, I have an experiment where the targets are amplified by many overlapping PCR amplicons, so each base in the target may be covered by multiple amplicons, and may be under a priming site for some of them. I'm matching inserts to amplicons post-aligment then trimming off the primers at the starts of the reads. I'm doing it all post-alignment because the primers used vary significantly in length (15-30bp) and I'd lose too much data if I just trimmed 30bp off the start of all reads.

zhongwulai commented 7 years ago

@tfenne Actually VarDict has amplicon mode to take care of trimming on the fly and detect/flag any amplicon bias variants. Check out this Wiki post: https://github.com/AstraZeneca-NGS/VarDict/wiki/Amplicon-Mode-in-VarDict