bcgsc / straglr

Tandem repeat expansion detection or genotyping from long-read alignments
Other
50 stars 9 forks source link

Understanding output #2

Closed MeiShu00 closed 2 years ago

MeiShu00 commented 2 years ago

Hi, I've recently tried using straglr in genotype-only mode to detect STR from nanopore data.

  1. My data is minimap2 aligned
  2. STR position I'm looking at is in hg38 (chr11:5555444-5555483, motif: GAAG), these coordinates were used as input to the required BED file. Note that to its left there is STR of AGAG (chr11:5555369-5555442) but was not included in the BED file as it was not what I'm looking for.
  3. For the support reads in the tsv file, i extracted them from the bam file to visualize in the IGV
    • For this particular read in the output tsv file: chr11 5555444 5555483 GAAG 27.6(10);9.0(412) ba5240c0-cf30-40a9-8811-b4e442a235ac 31.2 125 163 27.6

and this was the information when viewed in IGV: image

Hence, i have the following questions: a. As the reference (hg38) has 10 repeat units, tsv report that there are 31.2 repeat units (insertion of 125bp). May i check what might have caused it to call a 31.2 repeat units ? Could it have included the AGAG motif (reference has 18.8 copies) to its left? Does that mean that Straglr takes the whole read for calculations and not the nucleotides that fall within the coordinates in the BED file? b. Under read_start in the tsv file, does it include the soft clipped reads as well?

Do let me know if any other information is needed. Thank you !

readmanchiu commented 2 years ago

Hi @MeiShu00

This is a tricky locus as the target STR is so close (probably overlapping) the adjacent STR and their motifs are so similar. But I do believe the behaviour of Straglr should be just that only the sequence within the given coordinates is counted. From your example read, the repeat starts at read position 163 and the 125 bp following is the repeat. From the reference sizes, it does look like the neighbouring repeat is included. I can pick up a read from a public GIAB dataset to check Straglr's behaviour on this locus and get back to you.

Yes the soft-clipped part is also count for read_start, so 163 in your example means the 163 bp from the start, even though the start is soft-clipped.

readmanchiu commented 2 years ago

Hi @MeiShu00

I just checked this locus on the HG002 ONT sample This is the result of one support read:

chr11   5555444 5555483 GAAG    11.1(38)    efdaaabe-3d0d-42c6-8373-1ae00f0697aa    16.8    67  2980    11.1

Straglr gave a 67 bp size for the locus and the read_start of the repeat is 2980. I took the snippet around 2980 and web-blatted it. As you can see in the screen shot of the result below there is ~30 bp of GAAG inserted (the lowercase bases around 650). Add that to the 40 bp repeat in the reference gives ~70 bp, so 67 is not a bad sizing estimate for this read. So this says that Straglr did not include the neighbour AGAG when it reported the genotype. If your results suggest otherwise, I'm happy to check for you if you pass me some sequences to test.

AATAACAAA GACGATTGAC CTGCAGAGAT AGGAAAGGAT TCTTGGgAGG  50
AAGATATTGT AAATGAGGTT GATTTAACAA AAAaTATAAT AAAATAAATA  100
AtactTACTG AATTCTACCT GTAAGTCAAT TGAATTTAGC tCCTTCCTCC  150
CTCCCAGAGA GGTAGGTCCC CTTATTGTTT CCATAaTACA GACAAGAAAA  200
AGATGATACA CAGAAAGTAT ATCCAAATTT GTTCAGCTGT TATGGAAAAG  250
CCAAATTTCA ACAGGTAGTT TGACCTCAcA TCTTTTCGAC TCCACTTTAC  300
ACcttgcaag cTATTTTCAG TAGGCTAAAA CAtGAAAGGA AAGCATTTtt  350
atatAGAAAG AAAAGTGTGT AAAGTCCTTC AGGTAAGAAG GAGCTAGATG  400
TGGAAATAAG CTCCTGAAAG CAGCAGGTAC TTGGTGGTGG TGGTGGTGGT  450
tacagcagGT GTGcTTTAAT ATATGTGGCT CATATGCTAA AAAaTGAAAA  500
GAAAAAGAGA AAGAGAAAGA AgGAAAGAGA GAGAGAGAAA GAAAGAAAGA  550
GAAAGGGAGA GAGAGAGAAG GAGAGAGAAG AAGaaAAGGA AGGAAGGAAG  600
GAAGGAAGGA AGGAAGGAAG gaaggaagga aggaaggaag gaaggaaagt  650
TCTAATtaGA AAACATCaAA AATTATGAAA ATAGgAAACA TTTAGCAGGC  700
ACCTAGCCTC TTCAAGAGAC ATatTCATCT GGAAAGGTGG CATTCATCTG  750
GATGCATTGG TTATGCCATA AACCCTAAGT ATAACCAGCA TAAAAGGATG  800
CACCCTGTCC tAGTTTGCTA TTCTTATATT TAagTGCATT TGTCCAACCA  850
CCATTTGTTG AGGCAGctcc gctaGACAGT ATAGACGGgT TACGAGtGAA  900
AGAGCTgGGA GGCCAGACAC TCCTGGCTGC CATACACCAA CTCCATCAGT  950
TCTCTTTAGA TGAAATTGGG AAATTTATTT AACCACTCCA AGATTCTGTT  1000
TCCTCATGTG TATGAAGATC TAAagccacC ACAGGGTTTG AGGATTGATT  1050
MeiShu00 commented 2 years ago

Hi @readmanchiu

Thank you for your prompt reply ! Could i have your email for ease of sharing the data?

readmanchiu commented 2 years ago

No worries, my email: rchiu@bcgsc.ca