PacificBiosciences / pbbioconda

PacBio Secondary Analysis Tools on Bioconda. Contains list of PacBio packages available via conda.
BSD 3-Clause Clear License
247 stars 43 forks source link

Default PBSV output incompatible with Hiphase due to IUPAC #640

Closed mrvollger closed 7 months ago

mrvollger commented 7 months ago

Operating system redhat

Package name hiphase=v1.1.0, pbsv=2.9.0

Describe the bug This isn't a bug per-say, but I do think it will be a common pitfall in default behavior, which may lead to errors.

PBSV converts ambiguous characters to Ns by default re: https://github.com/PacificBiosciences/pbsv?tab=readme-ov-file#why-does-the-vcf-contain-no-ambiguous-iupac-ref-codes However, hiphase checks for exact sequence matches to the reference, so this occasionally results in errors.

Error message e.g.

[2024-01-25T19:18:04.791Z ERROR hiphase] Error while processing PhaseBlock { block_index: 1, coordinates: "chr10:131584412-131592429", num_variants: 2, sample_name: "PS00389" }:
[2024-01-25T19:18:04.791Z ERROR hiphase]   Reference mismatch error: variant at chr10:131592430 has REF allele = "CAATCTCGGCTCACTGCAACCTCCGTCTCCCAGGTTCAAGTGATTCTCTTGCTTAACCCTCCCGAGTAGCTGGGATTACAGGCACCCACAAGAACACCCAGCTGATTTTTGTATTTTTAGCAGAGACAGGGTTTCACTGTGTTGGCCAGGCTGGTCTCGAACTCCTGACCTTGTGATCTGCCTGCCTTGGCCTCCCAAAGTACTGGGATTAATTATTTTTCCTTTTTAAGGTTAAATAATATTCCATTTTGTGGATATGCCACATTTTGTTTATCCATTCATCTGTCAACAGACACTTGGGTTGCTTCCATCTTTTGACTATTGTGAATAATGCTGTTGTGGACATGGGTGTAGAAACATCTCTTTGAGGCTCTGCTTTTAATTCTTTGAGGTATATACCCAGAGGTGTAATTGCTGGATCATGTGAAATCTGAGAAACCACCATATTGTTTCTATAGTTGTGTAGTATCTCACTGTGGTTTTGATTTGCATTTTCCTAATTATTCATGTTGTTGAGCATCTTTTCATGTACTTATTGGTCATTTGTATATCATTGGAGAAATATATATTCAAGTCCTTTGTCTATTTTTTAATTGTGTTGTTTTTTGGTTGTTGAATTGCAAGAGTTCTTTATATATGGATAGTAATCCGTTATCAGATATATAATTTACAAATATTTCCTGCCATTCAGTGTGTTGCCTTTTACTCTGTTGACAGTGTCATTTGATTCACAAAAATTTTTAATATTTACATGTTCCAATTATCTGATTTTTTTGTTGCCTATGCTTTCGGTGTCGTAGCCAAGAAATCCTTGCCAAATGCAATGCCATGAAGCTGTGCCCCTACATTTTCTTGTGAGTATTCTAACTCTCATATCTAAGTCTTTGACTATTTTTAATTTCTGCATATGGTGTAAGGTAAGGGTACAACTTCATTCTTTTGCATGTGGCTATCCAGTTTTCCCAGTAACATTTGTTGAAAAGACTGTCCTTTTCCCTATTGGATAGTCCTAGCAACTTTTTAAAAAATCACAAGGCCATATATACAAGAGTTTATTTCTGGGCTCTCTATTCTATCTCACTGATCTATGTGTCTGTCTATACGTCAATACCACTCTGTTTTTAATACTGTAGATTTTTAGAAATTTTGAAACTAAGAAGTGTGAGACCTCCAACTGTGTTCTTTTTCAAGATTGTTTTTGCTATTTAGGGTCCCTTGAGATTCTATATGAATGTTAGGATAGATTTTTCTAGTTTTGTAAAAAAAAATTGATGTTGGAATTTTAAGATAAATTGCATTTAATCTAGAGACCACATCTTTCAATTTTAGGTCTTCTCATCTATGAACAAAGGATGTCTATTTTTGTAGTGTCTTTAATTTCTTTGAGCAATATTTCATAGTTTTCAGTGTACACATCTTTCACCTCCTTGGTTCAGTTTGTTTCTATTTTTTATTTTGTTTGGTCCCACTTTAAATGAAATTGCTTTCTTAATTTCTTTTTCAGGTTGTTCATTGTTATTGTATAGAAACACAGCTAATTTCTGTATGCTGAGTATTCTGTAAGTTTGCTAATTTTGTTATTAGTTCTATCATGTTTCTTATGGAATCTTTGGGGTTTTCTACATATGAAATTACATCATCTATGAAAGGGATCGTTTTACTTTTTATTTCCCAATTTTAATGCTTTTTATTTCCTAATTTATCTGGTCAAGATTTCCATTACTATGCTGAATTTAAAAGTAGGCATTCTTCCCTTGTGTCTTAGCTTAGAAGAAAAGTTTTCAATCTTTCATCATTAAGTATGATGTTAGCAATGGGCTTTCCATATATGGCCTTAATTATGTTGAGGTAGTTTCCTTCTGTTCCTAGTTTGGTGNATGTTTTTTATCATGGAAAGGTGTTGGATTTTGTCAAATATTTTTCTCCATCAATTGAGATGATCACATGGGAACTGTTTCTTCATTCTGTTAATGTAGTTATTACATTAATTCATTTTCATATGTTGAACTATCCTTGAATTTCAGAAATAAATCCCACGAGGTCATGTGTATAATTTTTTTGATGTGTCACTTAATTCTGTTCACTAATATTTGGTTGAGGATTTTTACATCAGTATTTATCAGAGATATTGATCTGTAGCTTAATTTTATTGTAGTACCTTTGTCTTGCTTTGGTGAAAGAGTAATCTTGGCCTTGAAGAATAAGTTTGAAAGTGTCCCCTTACCTTAAACTTTTTTGGAAACTTTTGAGAAGGATTAGTGTTAACTCTTCTTTAAATGTTTGGTAGAATTCACGAATGAAGCCATCAGCTCCTGGGATTTTCTTTGTTGGCAGATTTTGGATCATTGATTCAATCTCTTTGCTAGTTATATGTCTGTTCGTATTTTCTATTTCTTTGTGGNTTAGTCTTGGTAGGTGGTATATGTCTAGGAATTTATCCATTTTGTCTAGGTTGTCCAATTTTTTGGCATACAAATATTCATACTATTGTCTTATTAATATAATCATTTTATTTCTGTTAAATCAGTGGTAATGTCTGCACTTACATTTCTGATTTTAGTTATTGAGACTTCCCTCTTTTATCTTACTCAGTCGAACTAATTGTTCATTAATTTTGGTGATTTTTTCAAAGAACTGAACTTGGTTTTGCTAACTTACTCTACCATGTTCCTATTCTTTATTTCAGTTGTCTGTACTCTAGTCTTTATTATTTCTTTCCTTCTACTGGATTTGGGTTTAGTGTGTTCTCCCTTTTTCTACTTCTTTAAGGTATAATGTTAGATTGTTAATTTAAGATCTTTCTTCTTGTTTATCATAAGCATTTACACTATAAACTACCCTCCTAGCACAGATTTTGATGCATCTGGTAAGTTTTGGTATGTTTACTGTAGCCCTGCAATATAGTTTGAAGTCAGGTAATGTGATGCCTCCAGCTGTGTTCTTTTTGCTTAGGGTTGCCTTGGCCATTCGGGCTCTTTTTTGGTTCCATATGAATTTTAAAATAGTTTTTTCTAGTTCTGTGAAGAATGTCATTGGTAGCTTAATAAAAATAGCATTGAATCTGTACACTGCTTTGGGCAGTATGGTCATTTTAATAAGATTGATTCTTCCTATCTGTGAGCATGAGATTTTTAAAAATTTGTTTTTGTCTTACCTGATTTCTTTCAGCAGTGCTTTGTAATTCTCACTGCAGAGATCTTTCACCTCCCTGGTTAGCTGTATTCCTAGATATTTTNTCATTTTTGCAGCAATTGTGAATGAGATTGCCTTCCTGATTTGTTTCTCGGCTTGGTTTCTTCTTGTTGTTTGTGTACAGGAATGCTGGTGATTTTTCTACATTGATTTTGTATCCTGAAACTTTGCTGAAGTTGTTTATCAGCTGAAGGAGCTTTTGGGTCNAGACTATGGGTTTTTCTAGATATAGAATCATGTCATCTGCAAATAGGGATAGTCTGATATCCTCTCTTCCTATTTGGATATGCTTTATTTCTTTATTTTGCCTGATTGCTCTGGCTAAGACTTCCAATAATACTTGAATAGGATTGGTGAAAGAAGGCATTCTTGTCACGTGTTGGTTTTCAAAAGGAATTCTTCCAGCTTTTGCCCATTTAGTATGATGTTGCCTGTTAGTTTGTCACATATGGCTCTTATTATTTTGAGTTGTGTTCCAAAACATCATGGTGCTGGTACAAAAACAGGCACATAGACCAATGNAACAGATAGAGAGCCTAGAAATAAGACTGCACACTTACAACCATCTGATCTTCAACAAAGCTGACAAAAACAAGCAATGGGGAAAAGACTCCCTATTCAATAAATGGTACTTGGATAAGTGGCTAGCCATATGCAGAAGATTGAAGGTAGACCCCTTCCTTGCACCATATACCAAAATCAACTCAAGATGGATTAAAGACTTACATATAAAACCCAAAACTATAAAAAACCCTGGGAGACAACCTAGGCAATATTATCCTGTACATAGGAATGGGCAAAGATTTCATGACAAAGCAATCACAAAAGCAATCACAACAAAAACAAAAATTGACAAATAAGATCTAATTAAACTTAAGAGCTTCTGCACAGCAAAAGAAACTATCAGCAGAGTAAACAGACAACCTACAGGATGGCAGAAAATATTTGCATATTATGCATCTGACAAAGGTCTAATATCCAGCATCTATAAGAAACTTAAACAAGTTTATAAGCAAAAAACAAACAACCCCATTAAAAAGGGGGCAAAGGACATGAACACTTCTCAAAAGAAGACATACGTGCAACCAACAAGCATATGAAGAAAAGCTCAATATCACTGATCATTAGAGAAATGCAAATAAAAACCACAACGAGATACTGTCTCACAACAATCAGAATAGCATTATTAAAAATTCAAAAAAATAACAGATACTGGTGAGGTTGTGGAGAAAAGGGACCACTTATACACTGTTGATGAAAGTGTAAGTTAGTTCAACCATTGTGGAAAGCAGTATGGCGATTCTTCAAAGAAAGAGCTAAAAACAGAATTACCATTCAACTCAGGAATCCCATTACTGGGTATATGCCCAGAGGAATATAAATCATTCCACCATAAAGACACATGCACANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN", but reference genome has "CAATCTCGGCTCACTGCAACCTCCGTCTCCCAGGTTCAAGTGATTCTCTTGCTTAACCCTCCCGAGTAGCTGGGATTACAGGCACCCACAAGAACACCCAGCTGATTTTTGTATTTTTAGCAGAGACAGGGTTTCACTGTGTTGGCCAGGCTGGTCTCGAACTCCTGACCTTGTGATCTGCCTGCCTTGGCCTCCCAAAGTACTGGGATTAATTATTTTTCCTTTTTAAGGTTAAATAATATTCCATTTTGTGGATATGCCACATTTTGTTTATCCATTCATCTGTCAACAGACACTTGGGTTGCTTCCATCTTTTGACTATTGTGAATAATGCTGTTGTGGACATGGGTGTAGAAACATCTCTTTGAGGCTCTGCTTTTAATTCTTTGAGGTATATACCCAGAGGTGTAATTGCTGGATCATGTGAAATCTGAGAAACCACCATATTGTTTCTATAGTTGTGTAGTATCTCACTGTGGTTTTGATTTGCATTTTCCTAATTATTCATGTTGTTGAGCATCTTTTCATGTACTTATTGGTCATTTGTATATCATTGGAGAAATATATATTCAAGTCCTTTGTCTATTTTTTAATTGTGTTGTTTTTTGGTTGTTGAATTGCAAGAGTTCTTTATATATGGATAGTAATCCGTTATCAGATATATAATTTACAAATATTTCCTGCCATTCAGTGTGTTGCCTTTTACTCTGTTGACAGTGTCATTTGATTCACAAAAATTTTTAATATTTACATGTTCCAATTATCTGATTTTTTTGTTGCCTATGCTTTCGGTGTCGTAGCCAAGAAATCCTTGCCAAATGCAATGCCATGAAGCTGTGCCCCTACATTTTCTTGTGAGTATTCTAACTCTCATATCTAAGTCTTTGACTATTTTTAATTTCTGCATATGGTGTAAGGTAAGGGTACAACTTCATTCTTTTGCATGTGGCTATCCAGTTTTCCCAGTAACATTTGTTGAAAAGACTGTCCTTTTCCCTATTGGATAGTCCTAGCAACTTTTTAAAAAATCACAAGGCCATATATACAAGAGTTTATTTCTGGGCTCTCTATTCTATCTCACTGATCTATGTGTCTGTCTATACGTCAATACCACTCTGTTTTTAATACTGTAGATTTTTAGAAATTTTGAAACTAAGAAGTGTGAGACCTCCAACTGTGTTCTTTTTCAAGATTGTTTTTGCTATTTAGGGTCCCTTGAGATTCTATATGAATGTTAGGATAGATTTTTCTAGTTTTGTAAAAAAAAATTGATGTTGGAATTTTAAGATAAATTGCATTTAATCTAGAGACCACATCTTTCAATTTTAGGTCTTCTCATCTATGAACAAAGGATGTCTATTTTTGTAGTGTCTTTAATTTCTTTGAGCAATATTTCATAGTTTTCAGTGTACACATCTTTCACCTCCTTGGTTCAGTTTGTTTCTATTTTTTATTTTGTTTGGTCCCACTTTAAATGAAATTGCTTTCTTAATTTCTTTTTCAGGTTGTTCATTGTTATTGTATAGAAACACAGCTAATTTCTGTATGCTGAGTATTCTGTAAGTTTGCTAATTTTGTTATTAGTTCTATCATGTTTCTTATGGAATCTTTGGGGTTTTCTACATATGAAATTACATCATCTATGAAAGGGATCGTTTTACTTTTTATTTCCCAATTTTAATGCTTTTTATTTCCTAATTTATCTGGTCAAGATTTCCATTACTATGCTGAATTTAAAAGTAGGCATTCTTCCCTTGTGTCTTAGCTTAGAAGAAAAGTTTTCAATCTTTCATCATTAAGTATGATGTTAGCAATGGGCTTTCCATATATGGCCTTAATTATGTTGAGGTAGTTTCCTTCTGTTCCTAGTTTGGTGRATGTTTTTTATCATGGAAAGGTGTTGGATTTTGTCAAATATTTTTCTCCATCAATTGAGATGATCACATGGGAACTGTTTCTTCATTCTGTTAATGTAGTTATTACATTAATTCATTTTCATATGTTGAACTATCCTTGAATTTCAGAAATAAATCCCACGAGGTCATGTGTATAATTTTTTTGATGTGTCACTTAATTCTGTTCACTAATATTTGGTTGAGGATTTTTACATCAGTATTTATCAGAGATATTGATCTGTAGCTTAATTTTATTGTAGTACCTTTGTCTTGCTTTGGTGAAAGAGTAATCTTGGCCTTGAAGAATAAGTTTGAAAGTGTCCCCTTACCTTAAACTTTTTTGGAAACTTTTGAGAAGGATTAGTGTTAACTCTTCTTTAAATGTTTGGTAGAATTCACGAATGAAGCCATCAGCTCCTGGGATTTTCTTTGTTGGCAGATTTTGGATCATTGATTCAATCTCTTTGCTAGTTATATGTCTGTTCGTATTTTCTATTTCTTTGTGGKTTAGTCTTGGTAGGTGGTATATGTCTAGGAATTTATCCATTTTGTCTAGGTTGTCCAATTTTTTGGCATACAAATATTCATACTATTGTCTTATTAATATAATCATTTTATTTCTGTTAAATCAGTGGTAATGTCTGCACTTACATTTCTGATTTTAGTTATTGAGACTTCCCTCTTTTATCTTACTCAGTCGAACTAATTGTTCATTAATTTTGGTGATTTTTTCAAAGAACTGAACTTGGTTTTGCTAACTTACTCTACCATGTTCCTATTCTTTATTTCAGTTGTCTGTACTCTAGTCTTTATTATTTCTTTCCTTCTACTGGATTTGGGTTTAGTGTGTTCTCCCTTTTTCTACTTCTTTAAGGTATAATGTTAGATTGTTAATTTAAGATCTTTCTTCTTGTTTATCATAAGCATTTACACTATAAACTACCCTCCTAGCACAGATTTTGATGCATCTGGTAAGTTTTGGTATGTTTACTGTAGCCCTGCAATATAGTTTGAAGTCAGGTAATGTGATGCCTCCAGCTGTGTTCTTTTTGCTTAGGGTTGCCTTGGCCATTCGGGCTCTTTTTTGGTTCCATATGAATTTTAAAATAGTTTTTTCTAGTTCTGTGAAGAATGTCATTGGTAGCTTAATAAAAATAGCATTGAATCTGTACACTGCTTTGGGCAGTATGGTCATTTTAATAAGATTGATTCTTCCTATCTGTGAGCATGAGATTTTTAAAAATTTGTTTTTGTCTTACCTGATTTCTTTCAGCAGTGCTTTGTAATTCTCACTGCAGAGATCTTTCACCTCCCTGGTTAGCTGTATTCCTAGATATTTTWTCATTTTTGCAGCAATTGTGAATGAGATTGCCTTCCTGATTTGTTTCTCGGCTTGGTTTCTTCTTGTTGTTTGTGTACAGGAATGCTGGTGATTTTTCTACATTGATTTTGTATCCTGAAACTTTGCTGAAGTTGTTTATCAGCTGAAGGAGCTTTTGGGTCRAGACTATGGGTTTTTCTAGATATAGAATCATGTCATCTGCAAATAGGGATAGTCTGATATCCTCTCTTCCTATTTGGATATGCTTTATTTCTTTATTTTGCCTGATTGCTCTGGCTAAGACTTCCAATAATACTTGAATAGGATTGGTGAAAGAAGGCATTCTTGTCACGTGTTGGTTTTCAAAAGGAATTCTTCCAGCTTTTGCCCATTTAGTATGATGTTGCCTGTTAGTTTGTCACATATGGCTCTTATTATTTTGAGTTGTGTTCCAAAACATCATGGTGCTGGTACAAAAACAGGCACATAGACCAATGSAACAGATAGAGAGCCTAGAAATAAGACTGCACACTTACAACCATCTGATCTTCAACAAAGCTGACAAAAACAAGCAATGGGGAAAAGACTCCCTATTCAATAAATGGTACTTGGATAAGTGGCTAGCCATATGCAGAAGATTGAAGGTAGACCCCTTCCTTGCACCATATACCAAAATCAACTCAAGATGGATTAAAGACTTACATATAAAACCCAAAACTATAAAAAACCCTGGGAGACAACCTAGGCAATATTATCCTGTACATAGGAATGGGCAAAGATTTCATGACAAAGCAATCACAAAAGCAATCACAACAAAAACAAAAATTGACAAATAAGATCTAATTAAACTTAAGAGCTTCTGCACAGCAAAAGAAACTATCAGCAGAGTAAACAGACAACCTACAGGATGGCAGAAAATATTTGCATATTATGCATCTGACAAAGGTCTAATATCCAGCATCTATAAGAAACTTAAACAAGTTTATAAGCAAAAAACAAACAACCCCATTAAAAAGGGGGCAAAGGACATGAACACTTCTCAAAAGAAGACATACGTGCAACCAACAAGCATATGAAGAAAAGCTCAATATCACTGATCATTAGAGAAATGCAAATAAAAACCACAACGAGATACTGTCTCACAACAATCAGAATAGCATTATTAAAAATTCAAAAAAATAACAGATACTGGTGAGGTTGTGGAGAAAAGGGACCACTTATACACTGTTGATGAAAGTGTAAGTTAGTTCAACCATTGTGGAAAGCAGTATGGCGATTCTTCAAAGAAAGAGCTAAAAACAGAATTACCATTCAACTCAGGAATCCCATTACTGGGTATATGCCCAGAGGAATATAAATCATTCCACCATAAAGACACATGCACANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN".

This causes a W!=N mismatch and error in hiphase.

To Reproduce I have uploaded some small files and the reference is: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz

hiphase --bam error.hifi.bam  --vcf error.pbsv.vcf.gz --output-vcf out.vcf  --reference ~/assemblies/simple-names/hg38.fa --ignore-read-groups

Expected behavior Ideally, I think the default PBSV output should work with hiphase. And if not, I would be genuinely curious to hear the reasoning for this design.

Archive.zip

mrvollger commented 7 months ago

Adding @holtjma in case he doesn't see these by default.

And thanks to all in advance!

mrvollger commented 7 months ago

including @sjneph and @adrisede

holtjma commented 7 months ago

This sounds like an edge case that wasn't encountered in our initial testing. I'll take a look and verify I can reproduce the issue locally.

mrvollger commented 7 months ago

Thanks so much! Should only take a couple of seconds, I tried to make the files as small as possible while keeping the error (and let me know if it's not working for you).

holtjma commented 7 months ago

I managed to reproduce it locally, will hopefully have a fix in the next few business days.

holtjma commented 7 months ago

@mrvollger Turns out the patch was relatively simple, so give v1.2.1 a try on your full file. I verified that the snippet you shared will run to completion now.

mrvollger commented 7 months ago

Awesome, thanks so much. I will share if I have any issues. Thanks again.