CFIA-NCFAD / nf-flu

Influenza genome analysis Nextflow workflow
MIT License
17 stars 10 forks source link

Clair3 not variant calling ends of each segment #61

Open peterk87 opened 11 months ago

peterk87 commented 11 months ago

Clair3 does not call 16 bp at each end of a reference sequence (https://github.com/HKU-BAL/Clair3/issues/257)

Supplement Clair3 calls with Bcftools mpileup/call?

Codes1985 commented 5 months ago

Hello,

I figured I might as well reply to this issue, since it is semi-related.

I have been reading through the Clair3 documentation, and noticed a discussion regarding amplicon sequencing and the modification of a few parameters including changing --var_pct_full, -ref_pct_full and --var_pct_phasing, to 1 as well as enabling the --no_phasing_for_fa option. I was wondering if we should consider implementing as default for the pipeline.

I've also encountered a situation where a SNV, position 2169, is classified as RefCall in full alignment mode and PASS in pileup mode, but omitted from the merge_output.vcf.

Full Alignment Mode:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE
OQ993067.1  1470    .   T   C   43.68   PASS    F   GT:GQ:DP:AD:AF  1/1:43:1243:25,1177:0.9469
OQ993067.1  2169    .   G   .   3.31    RefCall F   GT:GQ:DP:AD:AF  0/0:3:1380:75:0.0543

Pileup Mode:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE
OQ993067.1  18  .   T   C   16.85   PASS    P   GT:GQ:DP:AD:AF  1/1:16:1182:28,1110:0.9391
OQ993067.1  344 .   G   .   24.42   RefCall P   GT:GQ:DP:AD:AF  0/0:24:1167:989:0.8475
OQ993067.1  345 .   G   A   25.54   PASS    P   GT:GQ:DP:AD:AF  1/1:25:1167:38,1107:0.9486
OQ993067.1  407 .   C   .   17.09   RefCall P   GT:GQ:DP:AD:AF  0/0:17:1168:886:0.7586
OQ993067.1  880 .   G   .   19.16   RefCall P   GT:GQ:DP:AD:AF  0/0:19:1187:1027:0.8652
OQ993067.1  986 .   G   .   26.00   RefCall P   GT:GQ:DP:AD:AF  0/0:26:1189:1049:0.8823
OQ993067.1  1149    .   G   A   16.00   PASS    P   GT:GQ:DP:AD:AF  1/1:16:1214:19,1159:0.9547
OQ993067.1  1273    .   C   .   19.26   RefCall P   GT:GQ:DP:AD:AF  0/0:19:1231:913:0.7417
OQ993067.1  1461    .   C   .   18.22   RefCall P   GT:GQ:DP:AD:AF  0/0:18:1243:1085:0.8729
OQ993067.1  1470    .   T   C   14.68   PASS    P   GT:GQ:DP:AD:AF  1/1:14:1243:25,1177:0.9469
OQ993067.1  2169    .   G   A   11.04   PASS    P   GT:GQ:DP:AD:AF  1/1:11:1380:75,1269:0.9196

merge_output.vcf

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE
OQ993067.1  18  .   T   C   16.85   PASS    P   GT:GQ:DP:AD:AF  1:16:1182:28,1110:0.9391
OQ993067.1  345 .   G   A   25.54   PASS    P   GT:GQ:DP:AD:AF  1:25:1167:38,1107:0.9486
OQ993067.1  1149    .   G   A   16.00   PASS    P   GT:GQ:DP:AD:AF  1:16:1214:19,1159:0.9547
OQ993067.1  1470    .   T   C   43.68   PASS    F   GT:GQ:DP:AD:AF  1:43:1243:25,1177:0.9469

I take this to mean that the Clair3 model has decided, by some metric, that there is insufficient evidence to suggest it is a variant. However, based on manual interrogation of the BAM file, it looks pretty convincing.

As you might recall, amongst the polymerase genes, you often see super high coverage of the 5' and 3' ends of the segment relative the middle. I believe this is linked the presence of defective viral genomes, which have been known to affect the sequencing of these segments. This particular SNV is right on the interface where we move from "normal" coverage to "super high" coverage, so this might be why Clair3 is tossing it.

I tested the --pileup_only flag, and it did recover the SNV and incorporate it into the consensus sequence. However, I prefer your idea of keeping the more stringent Clair3 settings (i.e., using both full alignment and pileup modes) and supplementing Clair3 with BCFTools mpileup to perhaps catch situations like this, as well as those SNVs located within the first and and last 16bp of the reference.

Thanks for your time, Cody

peterk87 commented 5 months ago

Hi @Codes1985 that's really interesting the discrepancy between full alignment and pileup results from Clair3. Honestly, I'm not sure what's going on and maybe the Clair3 devs would be able to enlighten us both. I don't see how the numbers in the AD field make any sense. It looks like a number is missing... off by one error? Supplementing Clair3 variant calling results with Bcftools and maybe even Medaka might be necessary just to ensure that all possible variants are captured without certain regions being ignored or obvious variants being dropped.

peterk87 commented 5 months ago

@Codes1985 maybe updating the versions of Clair3 and other tools in the pipeline might help (#69)

Codes1985 commented 5 months ago

@peterk87 I was thinking it might be worthwhile to bump up to the latest versions of the tools.