artic-network / artic-ncov2019

ARTIC nanopore protocol for nCoV2019 novel coronavirus
Creative Commons Attribution 4.0 International
168 stars 166 forks source link

bcftools overlapping variants issue #38

Open Psy-Fer opened 4 years ago

Psy-Fer commented 4 years ago

Hey there.

Getting an error on the last step in the medaka workflow

bcftools consensus -f PQPR059970.preconsensus.fasta PQPR059970.pass.vcf.gz -m PQPR059970.coverage_mask.txt -o PQPR059970.consensus.fasta

It gives me the following

Note: the --sample option not given, applying all records regardless of the genotype
The fasta sequence does not match the REF allele at MN908947.3:6312:
   .vcf: [C]
   .vcf: [A] <- (ALT)
   .fa:  [N]AAAACCAGTTGAAACATCAAAT.......rest of the reference........

This seems to relate to these 2 issues from bcftools https://github.com/samtools/bcftools/issues/961 https://github.com/samtools/bcftools/issues/600

That they fixed using the latest github builds, rather than the 1.9 release.

I'm in the middle of compiling bcftools from source to give this a test to see if it fixes it.

I'll let you know how that goes.

Cheers James

Psy-Fer commented 4 years ago

Yea, using

bcftools 1.10.2-86-ge313e0f
Using htslib 1.10.2-85-g16f62c5

didn't fix it. sigh. Any thoughts?

mi-koch commented 3 years ago

I can second that. Had multiple instances where this happened. I was not able to generate a consensus sequence here. Is there any possibility you can check out https://github.com/samtools/bcftools/commit/21fd8dad371f191c2fbdcc1e18ea1ed1268962bf and update bcftools in the pipeline?

edit, 20210119 I have tried with bcftools 1.10 and 1.11, both giving the same error message. Is there anything to fix this issue? I can not see a pattern here...

Shaokang123 commented 3 years ago

Hi,

We saw this issue too. The root cause is mainly because XXX.pass.vcf and XXX.fail.vcf gave different mutation description for same loci of genome. To be more specific, for example, loci 7065 in XXX.fail.vcf showed “GA -> G”, which means loci 7066 is deletion, but XXX.pass.vcf showed loci 7066 is a “A-G” SNP. Then what ARTIC will do is to follow the fail.vcf to mask loci 7066 as “N” in your preconsensus.fasta via “artic_mask” first. Then try to generate consensus using the SNP described in pass.vcf (it will give the error you described since the pass.vcf showed the 7066 is “A”, but in preconsensus.fasta, it was changed to “N” already).

The in-house fix in our team is to scan each described mutation in both vcf files (for the failed assemblies only), in cases of discordance at the same loci, use the mutation found in XXX.pass.vcf (higher quality ones) when creating consensus assembly, and ignore the XXX.fail.vcf call (low-quality ones). Then you can continue the rest of artic steps using the newly generated vcf files (use artic_mask to generate the preconsensus.fasta, then use the above mentioned command to generate consenus via bcf_tools).

Also there might be source code based fixes already in the live code of artic (https://github.com/artic-network/fieldbioinformatics), the version here is for Artic network SOP which is not that frequently updated.

Hope it can help you.

Regards, Shaokang

will-rowe commented 3 years ago

Hi @Psy-Fer, @mi-koch and @Shaokang123

Firstly - sorry for not replying sooner (in particular to @Psy-Fer...). This sounds like something that needs fixing - I will open an issue over in the artic pipeline repo. Before I do, can @mi-koch and @Psy-Fer confirm if @Shaokang123's suggestion would explain their issue?

The explanation makes sense to me - the pipeline would fall over at the consensus step if we had overlapping vars in the PASS/FAIL. That being said, I'm not sure if we want to rescue these sites - the mask might be the preferred behaviour in the case of uncertainty with respect to the reference.

Either way, this needs a fix. The question is, do we mask in this scenario or do we rescue the pass variant. I think we default on one and then offer a flag to do the other. In the next pipeline release we have a --strict flag which performs some more stringent vcf checks prior to masking, maybe the strict mode could also tackle this. Thoughts @nickloman ?

nickloman commented 3 years ago

I generally agree we'd mask a site that was in conflict pass and fail in the overlap, as the pipeline is intended to be conservative. You could maybe add an option to allow such sites to be rescued.

mi-koch commented 3 years ago

Dear all,

thank you all for the responses. I was also digging around in the vcf files but was not able to tackle the issue. Thanks for the input @Shaokang123!

Since running everything with artic 1.1.3 now, this problem did not occur anymore. I also re-ran the failed samples with 1.1.3, which ended up in consensus sequences being produced. So all good from my side!

Thanks again for all the effort!

Shaokang123 commented 3 years ago

Hi @will-rowe and @nickloman,

Thanks for your comment.

In our particular in-house test, this error only occurred when there was a conflict between “indel of fail.vcf” and “SNP of pass.vcf”. In one of our tests, we have a few datasets using same single sars-cov-2 sample with high repeat number (n=32), and only 1 or 2 out of the 32 repeats have the conflicted indel description in the fail.vcf, which indicates the one in fail.vcf is a false positive report. In addition, here I also uploaded one pair of the vcf files, in which the loci 7066 in pass.vcf (SNP) and loci 7065 in fail.vcf (indel) have conflicts. The quality descriptions support pass.vcf much more (high DP and high score), indicates the false positive indel in fail.vcf is likely caused by some ONT noises.

So our in-house solutions in our particular cases to support pass.vcf are based on above mentioned test and data. But I also agree with @nickloman to make it conservative. So I would think to add a flag like “--strict” and let the users to tweak and decide by themselves to mask the conflict loci or use the high quality ones may be the best in next artic release.

vcfs.zip

Regards, Shaokang

will-rowe commented 3 years ago

So, just to follow up on 2 options for how we could do this.

  1. prior to generating the consensus, use bcftools norm --check-ref s ... to update the reference sites in *.pass.vcf to reflect any Ns which were introduced as a result of low coverage/fail vars. This will include pass variants from the final consensus at sites where low quality variants are also found / low coverage regions.

  2. run the bcftools consensus command with an exclude statement to ignore any sites which are marked as N in the preconsensus. This will exclude any variants from the final consensus at sites where low quality variants are also found / low coverage regions.

I think in light of the earlier conversation, we want option 2. So I'll add this to the next release PR.