epi2me-labs / wf-human-variation

Other
95 stars 42 forks source link

straglr outputs repeat units differ from the ones specified in the wf_str_repeats.bed #58

Closed ymcki closed 7 months ago

ymcki commented 1 year ago

Operating System

Ubuntu 22.04

Other Linux

No response

Workflow Version

1.6.1

Workflow Execution

Command line

EPI2ME Version

No response

CLI command run

No response

Workflow Execution - CLI Execution Profile

None

What happened?

Since it is not possible to submit issues at the straglr github, so I submit it here.

I ran the pipeline with the latest ONT HG002 data I downloaded from https://labs.epi2me.io/giab-2023.05/

I noticed that some repeat units in the vcf output from straglr is very different from the ones specified in column four of wf_str_repeats.bed, This messes up the reference count and as a result the calls. Is this a bug or a feature of straglr? Are there any way to force straglr to stick to the repeat unit specified in the bed file?

Relevant log output

VCF output from the calls that have repeat units very different from specified

chr3    129172696       .       C       <STR8>,<STR4>   .       PASS    SVTYPE=STR;END=129172732;REF=9;RL=36;RU=CAGA;REPID=CNBP;VARID=CNBP_CA       GT:SO:CN:CI:AD_SP:AD_FL:AD_IR   1/2:SPANNING/SPANNING:8/4:7-9/3-5:22/5:0/0:0/0
chr3    63912714        .       G       <STR5>,<STR2>   .       PASS    SVTYPE=STR;END=63912726;REF=1;RL=12;RU=GCCACCGCC;REPID=ATXN7;VARID=ATXN7_GCC        GT:SO:CN:CI:AD_SP:AD_FL:AD_IR   1/2:SPANNING/SPANNING:5/2:5-5/2-2:16/21:0/0:0/0
chr9    69037261        .       A       <STR4>  .       PASS    SVTYPE=STR;END=69037286;REF=2;RL=25;RU=AAAAAGAAG;REPID=FXN;VARID=FXN_A      GT:SO:CN:CI:AD_SP:AD_FL:AD_IR   1/1:SPANNING/SPANNING:4/4:4-4/4-4:17.0/17.0:0/0:0/0
chr12   50505001        .       G       <STR8>,<STR4>   .       PASS    SVTYPE=STR;END=50505022;REF=3;RL=21;RU=GCCGGA;REPID=DIP2B;VARID=DIP2B       GT:SO:CN:CI:AD_SP:AD_FL:AD_IR   1/2:SPANNING/SPANNING:8/4:8-8/3-4:15/23:0/0:0/0
chr15   22786677        .       C       <STR6>,<STR4>   .       PASS    SVTYPE=STR;END=22786701;REF=4;RL=24;RU=CGCGGG;REPID=NIPA1;VARID=NIPA1       GT:SO:CN:CI:AD_SP:AD_FL:AD_IR   1/2:SPANNING/SPANNING:6/4:6-6/3-5:23/6:0/0:0/0
chr21   43776443        .       G       <STR4>  .       PASS    SVTYPE=STR;END=43776479;REF=5;RL=36;RU=GGGGCGC;REPID=CSTB;VARID=CSTB        GT:SO:CN:CI:AD_SP:AD_FL:AD_IR   1/1:SPANNING/SPANNING:4/4:4-4/4-4:13.5/13.5:0/0:0/0

Application activity log entry

No response

philres commented 1 year ago

Hi and thank you for reporting this.

Would it be possible for you to share the reads (as a BAM file ideally) that overlap with those incorrectly called STRs to help us debug the issue?

Thank you, Philipp

ymcki commented 1 year ago

Thanks for your reply. I have the haplotagged.bam but it is very big. What command can I use to extract the info you want?

RenzoTale88 commented 1 year ago

@ymcki starting from a bed file with the STR intervals (you can find it here) you can extract the reads in the regions using samtools:

samtools view -hb --target-file wf_str_repeats.bed input.bam > output.bam
ymcki commented 1 year ago

I did that and extracted a bam that has a size of 34,669,130. Where do I upload it?

philres commented 1 year ago

Hi, just wanted to confirm that I was able to reproduce the error you reported. It seems to affect regions where two tandem repeats are in close proximity. In that case straglr sometimes picks up the repeat right upstream of the target repeat thus reporting a wrong repeat unit. We are working on a fix for this and will get back to you as soon as I have an update.

Thanks again for reporting this.

Philipp

ymcki commented 1 year ago

Great! Looking forward to your fix.

vlshesketh commented 7 months ago

Hi @ymcki - we are continuing to investigate this issue but the fix for this is not trivial. In the meantime, we are actively looking into a replacement STR genotyping tool which should avoid this type of error, and so please keep your eye on the repository for future updates on this front.

vlshesketh commented 7 months ago

Hi @ymcki we will close this ticket for now, but have noted this as an issue with Straglr, and are looking to resolve it with the incorporation of a new tool for STR genotyping in the workflow.