Closed ymcki closed 7 months ago
Hi and thank you for reporting this.
Would it be possible for you to share the reads (as a BAM file ideally) that overlap with those incorrectly called STRs to help us debug the issue?
Thank you, Philipp
Thanks for your reply. I have the haplotagged.bam but it is very big. What command can I use to extract the info you want?
@ymcki starting from a bed file with the STR intervals (you can find it here) you can extract the reads in the regions using samtools
:
samtools view -hb --target-file wf_str_repeats.bed input.bam > output.bam
I did that and extracted a bam that has a size of 34,669,130. Where do I upload it?
Hi, just wanted to confirm that I was able to reproduce the error you reported. It seems to affect regions where two tandem repeats are in close proximity. In that case straglr sometimes picks up the repeat right upstream of the target repeat thus reporting a wrong repeat unit. We are working on a fix for this and will get back to you as soon as I have an update.
Thanks again for reporting this.
Philipp
Great! Looking forward to your fix.
Hi @ymcki - we are continuing to investigate this issue but the fix for this is not trivial. In the meantime, we are actively looking into a replacement STR genotyping tool which should avoid this type of error, and so please keep your eye on the repository for future updates on this front.
Hi @ymcki we will close this ticket for now, but have noted this as an issue with Straglr, and are looking to resolve it with the incorporation of a new tool for STR genotyping in the workflow.
Operating System
Ubuntu 22.04
Other Linux
No response
Workflow Version
1.6.1
Workflow Execution
Command line
EPI2ME Version
No response
CLI command run
No response
Workflow Execution - CLI Execution Profile
None
What happened?
Since it is not possible to submit issues at the straglr github, so I submit it here.
I ran the pipeline with the latest ONT HG002 data I downloaded from https://labs.epi2me.io/giab-2023.05/
I noticed that some repeat units in the vcf output from straglr is very different from the ones specified in column four of wf_str_repeats.bed, This messes up the reference count and as a result the calls. Is this a bug or a feature of straglr? Are there any way to force straglr to stick to the repeat unit specified in the bed file?
Relevant log output
Application activity log entry
No response