Open macieksk opened 3 years ago
Hello, we are sequencing SarsCov-2 genome and we notice this situation too. We also need "Good heterozygotic variants" be published as pass and not as fail. The PR seems ok but I noted that it was not incorporated in the current release of Artic bfx pipeline. Any news about this feature in the future?
I think we just recently observed a related issue using artic v1.2.1.
We noticed that the 28262 G>GAACA insertion in P.1 lineage was filtered-out and the pipeline assigned a single "N" letter on this site when "--medaka" option was used. This seems because longshot annotated this variation as heterozygotic genotype and the pipeline throwed this variation into xx.fail.vcf file.
As I looked GISAID database up to 16 March, there are 36 entries hit by 'grep CTAAACNAACAAA'. Approx. half of them were P.1 and the remaining entries were B.1.1.7 and B.1.526. I assume those "N" site were derived from the same issue with "--medaka", and if I am right, this option should be used carefully in current version of pipeline.
Default nanopolish mode precisely incorporated this variantion to consensus.
Hi, as described in this pull request https://github.com/artic-network/fieldbioinformatics/pull/58 , I found out that a more precise filtering for heterozygotic variants for
artic minion --medaka
pipeline is desirable.Here is one of many examples I found in my samples of significant heterozygotic variants, which were filtered out by
artic_vcf_filter --longshot
. With the proposed patch such variants are left in*.pass.vcf.gz
files, while at the same time homopolymer false positives are still put in*.fail.vcf
(all of them in my over a dozen samples).Good heterozygotic variant filtered out with current pipeline, retained with the patch:![image](https://user-images.githubusercontent.com/439861/89559222-a9a07700-d815-11ea-915b-a1c4a6eb4ccc.png)
MN908947.3 24872 . G T 500.0 PASS DP=400;AC=120,227;AM=53;MC=0;MF=0.0;MB=0.0;AQ=11.48;GM=1;PH=6.02,6.02,6.02,6.02;SC =None; GT:GQ:PS:UG:UQ 0/1:147.24:.:0/1:147.24
False positive homopolymer filtered out with current and patched pipelines:![image](https://user-images.githubusercontent.com/439861/89559373-d9e81580-d815-11ea-9a37-ef3d877ae410.png)
MN908947.3 10527 . C CT 96.06 PASS DP=398;AC=130,59;AM=209;MC=0;MF=0.0;MB=0.0;AQ=7.4;GM=1;PH=6.02,6.02,6.02,6.02;SC=None; GT:GQ:PS:UG:UQ 0/1:96.06:.:0/1:96.06
Do you think this more precise filtering could be the default? Maybe with different default filtering parameters, although I found these in the patch (0.5 minimal fraction of alternate reads, 12 reads minimum to retain) to be ok.