Closed MeiShu00 closed 2 years ago
Hi @MeiShu00
This is a tricky locus as the target STR is so close (probably overlapping) the adjacent STR and their motifs are so similar. But I do believe the behaviour of Straglr should be just that only the sequence within the given coordinates is counted. From your example read, the repeat starts at read position 163 and the 125 bp following is the repeat. From the reference sizes, it does look like the neighbouring repeat is included. I can pick up a read from a public GIAB dataset to check Straglr's behaviour on this locus and get back to you.
Yes the soft-clipped part is also count for read_start, so 163 in your example means the 163 bp from the start, even though the start is soft-clipped.
Hi @MeiShu00
I just checked this locus on the HG002 ONT sample This is the result of one support read:
chr11 5555444 5555483 GAAG 11.1(38) efdaaabe-3d0d-42c6-8373-1ae00f0697aa 16.8 67 2980 11.1
Straglr gave a 67 bp size for the locus and the read_start of the repeat is 2980. I took the snippet around 2980 and web-blatted it. As you can see in the screen shot of the result below there is ~30 bp of GAAG inserted (the lowercase bases around 650). Add that to the 40 bp repeat in the reference gives ~70 bp, so 67 is not a bad sizing estimate for this read. So this says that Straglr did not include the neighbour AGAG when it reported the genotype. If your results suggest otherwise, I'm happy to check for you if you pass me some sequences to test.
AATAACAAA GACGATTGAC CTGCAGAGAT AGGAAAGGAT TCTTGGgAGG 50
AAGATATTGT AAATGAGGTT GATTTAACAA AAAaTATAAT AAAATAAATA 100
AtactTACTG AATTCTACCT GTAAGTCAAT TGAATTTAGC tCCTTCCTCC 150
CTCCCAGAGA GGTAGGTCCC CTTATTGTTT CCATAaTACA GACAAGAAAA 200
AGATGATACA CAGAAAGTAT ATCCAAATTT GTTCAGCTGT TATGGAAAAG 250
CCAAATTTCA ACAGGTAGTT TGACCTCAcA TCTTTTCGAC TCCACTTTAC 300
ACcttgcaag cTATTTTCAG TAGGCTAAAA CAtGAAAGGA AAGCATTTtt 350
atatAGAAAG AAAAGTGTGT AAAGTCCTTC AGGTAAGAAG GAGCTAGATG 400
TGGAAATAAG CTCCTGAAAG CAGCAGGTAC TTGGTGGTGG TGGTGGTGGT 450
tacagcagGT GTGcTTTAAT ATATGTGGCT CATATGCTAA AAAaTGAAAA 500
GAAAAAGAGA AAGAGAAAGA AgGAAAGAGA GAGAGAGAAA GAAAGAAAGA 550
GAAAGGGAGA GAGAGAGAAG GAGAGAGAAG AAGaaAAGGA AGGAAGGAAG 600
GAAGGAAGGA AGGAAGGAAG gaaggaagga aggaaggaag gaaggaaagt 650
TCTAATtaGA AAACATCaAA AATTATGAAA ATAGgAAACA TTTAGCAGGC 700
ACCTAGCCTC TTCAAGAGAC ATatTCATCT GGAAAGGTGG CATTCATCTG 750
GATGCATTGG TTATGCCATA AACCCTAAGT ATAACCAGCA TAAAAGGATG 800
CACCCTGTCC tAGTTTGCTA TTCTTATATT TAagTGCATT TGTCCAACCA 850
CCATTTGTTG AGGCAGctcc gctaGACAGT ATAGACGGgT TACGAGtGAA 900
AGAGCTgGGA GGCCAGACAC TCCTGGCTGC CATACACCAA CTCCATCAGT 950
TCTCTTTAGA TGAAATTGGG AAATTTATTT AACCACTCCA AGATTCTGTT 1000
TCCTCATGTG TATGAAGATC TAAagccacC ACAGGGTTTG AGGATTGATT 1050
Hi @readmanchiu
Thank you for your prompt reply ! Could i have your email for ease of sharing the data?
No worries, my email: rchiu@bcgsc.ca
Hi, I've recently tried using straglr in genotype-only mode to detect STR from nanopore data.
and this was the information when viewed in IGV:![image](https://user-images.githubusercontent.com/69883139/131780430-e307bd9d-ff29-4b88-b9b6-8e2b2dd3f6cb.png)
Hence, i have the following questions: a. As the reference (hg38) has 10 repeat units, tsv report that there are 31.2 repeat units (insertion of 125bp). May i check what might have caused it to call a 31.2 repeat units ? Could it have included the AGAG motif (reference has 18.8 copies) to its left? Does that mean that Straglr takes the whole read for calculations and not the nucleotides that fall within the coordinates in the BED file? b. Under read_start in the tsv file, does it include the soft clipped reads as well?
Do let me know if any other information is needed. Thank you !