artic-network / fieldbioinformatics

The ARTIC field bioinformatics pipeline
MIT License
110 stars 68 forks source link

Heterozygotic variants more precise filtering #58

Closed macieksk closed 3 years ago

macieksk commented 3 years ago

The original artic_vcf_filter --longshot (used in Artic Nanopore Medaka pipeline) filters out heterozygotic variants completely. This causes omissions of otherwise good mosaic variants present in sequenced virus samples. For example, a proper variant present in only 70% of reads used to be filtered out. This patch adds options for a more precise control of heterozygotic variants filtering with moderately permissive defaults, which should filter out nanopore homopolymer false positives. Old behavior can be enabled with `--hetmf Inf'.

usage: artic_vcf_filter [-h] [--nanopolish] [--medaka] [--longshot]
                        [--no-frameshifts]
                        [--heterozygotic-min-fraction HETMF]
                        [--heterozygotic-min-reads HETMR]
                        inputvcf output_pass_vcf output_fail_vcf

positional arguments:
  inputvcf
  output_pass_vcf
  output_fail_vcf

optional arguments:
  -h, --help            show this help message and exit
  --nanopolish
  --medaka
  --longshot
  --no-frameshifts
  --heterozygotic-min-fraction HETMF, --hetmf HETMF
                        minimal fraction of alternate allele reads for a
                        heterozygotic variant to be accepted (for medaka, and
                        longshot filters) (default: 0.5)
  --heterozygotic-min-reads HETMR, --hetmr HETMR
                        minimal number of alternate allele reads for a
                        heterozygotic variant to be accepted (for medaka, and
                        longshot filters) (default: 12)

An example of hetereozygotic variant accepted with the default parameters. MN908947.3 24872 . G T 500.0 PASS DP=400;AC=120,227;AM=53;MC=0;MF=0.0;MB=0.0;AQ=11.48;GM=1;PH=6.02,6.02,6.02,6.02;SC =None; GT:GQ:PS:UG:UQ 0/1:147.24:.:0/1:147.24

An example of filtered out homopolymer false positive. > MN908947.3 10527 . C CT 96.06 PASS DP=398;AC=130,59;AM=209;MC=0;MF=0.0;MB=0.0;AQ=7.4;GM=1;PH=6.02,6.02,6.02,6.02;SC=None; GT:GQ:PS:UG:UQ 0/1:96.06:.:0/1:96.06

nickloman commented 3 years ago

Thanks for this PR. It looks interesting but we will need to give it a thorough review and test, so please bear with us.

victormaricato commented 3 years ago

@macieksk can you please simplify this Pull Request? It is full of unnecessary changes, such as code formatting the whole repository.

I am interested in this change and would be glad to contribute with the review.

macieksk commented 3 years ago

@maricatovictor Hi, I'm not sure what happened to this pull request, it used to be relatively simple, at least that's what I remember. Is it because of recent pipeline updates? Anyway, I'll take a look into this in few hours, possibly create a new simple pull request. Thanks.

macieksk commented 3 years ago

@maricatovictor The most recent pull request with this mod is #77 . I was only able to test if it runs without errors. The modified filter is implemented for --medaka option only.