artic-network / fieldbioinformatics

The ARTIC field bioinformatics pipeline
MIT License
110 stars 68 forks source link

Heterozygotic variants more precise filtering #59

Open macieksk opened 3 years ago

macieksk commented 3 years ago

Hi, as described in this pull request https://github.com/artic-network/fieldbioinformatics/pull/58 , I found out that a more precise filtering for heterozygotic variants for artic minion --medaka pipeline is desirable.

Here is one of many examples I found in my samples of significant heterozygotic variants, which were filtered out by artic_vcf_filter --longshot. With the proposed patch such variants are left in *.pass.vcf.gz files, while at the same time homopolymer false positives are still put in *.fail.vcf (all of them in my over a dozen samples).

Good heterozygotic variant filtered out with current pipeline, retained with the patch: MN908947.3 24872 . G T 500.0 PASS DP=400;AC=120,227;AM=53;MC=0;MF=0.0;MB=0.0;AQ=11.48;GM=1;PH=6.02,6.02,6.02,6.02;SC =None; GT:GQ:PS:UG:UQ 0/1:147.24:.:0/1:147.24 image

False positive homopolymer filtered out with current and patched pipelines: MN908947.3 10527 . C CT 96.06 PASS DP=398;AC=130,59;AM=209;MC=0;MF=0.0;MB=0.0;AQ=7.4;GM=1;PH=6.02,6.02,6.02,6.02;SC=None; GT:GQ:PS:UG:UQ 0/1:96.06:.:0/1:96.06 image

Do you think this more precise filtering could be the default? Maybe with different default filtering parameters, although I found these in the patch (0.5 minimal fraction of alternate reads, 12 reads minimum to retain) to be ok.

kitajimajoao commented 3 years ago

Hello, we are sequencing SarsCov-2 genome and we notice this situation too. We also need "Good heterozygotic variants" be published as pass and not as fail. The PR seems ok but I noted that it was not incorporated in the current release of Artic bfx pipeline. Any news about this feature in the future?

ItokawaK commented 3 years ago

I think we just recently observed a related issue using artic v1.2.1.

We noticed that the 28262 G>GAACA insertion in P.1 lineage was filtered-out and the pipeline assigned a single "N" letter on this site when "--medaka" option was used. This seems because longshot annotated this variation as heterozygotic genotype and the pipeline throwed this variation into xx.fail.vcf file.

As I looked GISAID database up to 16 March, there are 36 entries hit by 'grep CTAAACNAACAAA'. Approx. half of them were P.1 and the remaining entries were B.1.1.7 and B.1.526. I assume those "N" site were derived from the same issue with "--medaka", and if I am right, this option should be used carefully in current version of pipeline.

Default nanopolish mode precisely incorporated this variantion to consensus.