Ahhgust / STRaitRazor

MIT License
11 stars 6 forks source link

a strange issue for calling locus D6S477 by STRaitRazor #4

Open yangjw1996 opened 1 year ago

yangjw1996 commented 1 year ago

Hi, I use flanking sequence of D6S477 as 5' and 3' anchors to call this locus, and the sample is 9948 (a standard sample in forensic). To approve the tolerance of sequecing error or SNP variants in the 5' and 3' anchors, I try to make the anchors short, and find the calling results become strange. With shorter anchors , I can't call the allele 11 which I can call with longer anchors. And it only happens on allele 11. The genotype of 9948 on locus D6S477 is 11/16. I can call the allele 16 with both longer or shorter anchors. I exam the orignial fastq file. I am pretty sure the fastq file contains adequate sequences of allele 11 and allele 16, and I can grep both shorter and longer anchors in these sequence. So theoretically, allele 11 can be called by shorter anchors. I don't understand why allele 11 is missing in fact.

Here, I paste a example fastq (only contains sequences of allele 11) , and the anchors and parameters I use. @V350093041L1C001R02000465510 CGCAGGGCTGATGAGGTGAAATATTTGCAAAACAATCTATCTATATCTATCTATATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCATCTTATATGTGTTGTTGTTGAGGTTGTTTGAGATATCCCCCAGGAGAAACAGAAATATTTATGTTTCTTATGTTTCCCCTCTTTTGTCTTTGAAGTCCCAAACACCAATAAGGA + FFFFEFFF>FF@FBFFEF<EFFEFFEFFFF<FFFFFFFEFFFFFFFFFFFFFFFFFFFFFEFFFFFFFFFFFFGFFFFFFFFFFFGFFFGFFFFFFFFFFFFGFFFFFFFFEFFFFFF)FFFFFFFFFFFFFFFFFFBFFFFAF8FFFFFFF>FGFFGFFGFGFFGFGGFGFGFFFFGFGFFFGFGGGGFGFFGGFGFGFGGGGGFGGFGDG @V350093041L1C002R04800214402 CGCAGGGCTGATGAGGTGAAATATTTGCAAAACAATCTATCTATATCTATCTATATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCATCTTATATGTGTTGTTGTTGAGGTTGTTTGAGATATCCCCCAGGAGAAACAGAAATATTTATGTTTCTTATGTTTCCCCTCTTTTGTCTTTGAAGTCCCAAACACCAATAAGGA + FFFFFFFFFFDFF@FFEFFFFFFFFFFGFFFFFFFFGFFFFFFFFFFFFFFFFFGFFGFFFGFFFFFFFFDFFGFGGGGFFGFFFFGFFGFGGGFFFGFGGGFGGFGFFFGFFFGFFFCFFFFFFFGGFFFFGGGFFGGFFGCGFFFGFGFGFGFGFGFGFGFGGGFGGFFGGGGFGGGGGFGGFGGGGFBFFGGFGFFFGGGFGGFGGGFG @V350093041L1C001R04800702540 CTCAGGGCTGATGAGGTGAAATATTTGCAAAACAATCTATCTATATCTATCTATATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCATCTTATATGTGTTGTTGTTGAGGTTGTTTGAGATATCCCCCAGGAGAAACAGAAATATTTATGTTTCTTATGTTTCCCCTCTTTTGTCTTTGAAGTCCCAAACACCAATAAGGA + FFFDD@?CFC>DBBD?DDDFFFFFFFEFFEEFFF@FDFEFFFEFDFFFFFFFFFFFEFDFFFFFFFFEFFFFFFEFFFFFEFEFFGFFFFFFFGFFFFGFFFFFFFFEEGGDFFEFFFCFEFFFFFFFBEFFFGFFFDFCF<FFFFEFFFFFF;FFFGGGFFGFFGFFFFFFFFFFFGFGGGGFGGGGFGA@FFGFEFEFGGGFGFGGGGFF

Marker Type 5'Anchor 3'Anchor Motif Period Offset

D6S477-1 AUTOSOMAL ATCTATATCTATCTA ATCTTATATGTG TATCTATC 4 0 D6S477-2 AUTOSOMAL AAAACAATCTATCTATATCTATCTA ATCTTATATGTGTTGTTGTTGAGGT TATCTATC 4 0 D6S477-3 AUTOSOMAL AATCTATCTATATCTATCTA ATCTTATATGTGTTGTTGTTGAGGT TATCTATC 4 0

STRaitRazor parameters: str8rzr -p 8 -a 1 -m 1

yangjw1996 commented 1 year ago

By the way, the core sequence and motif of D6S477 may be debatable. I am not sure if it was the wrong core or motif that caused the strange issue when calling D6S477. But I believe the core or motif is not the key part of this issue.

Here, I paste the the core and motif of D6S477 that I assigned. The core sequence is in [], motif is TATC. CACAGGGCTGATGAGGTGAAATATTTGCAAAACAATCTATCTATATCTATCTA [TATCTATCTGTCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATCTATC] ATCTTATATGTGTTGTTGTTGAGGTTGTTTGAGATATCCCCCAGGAGAAACAGAAATATTTATGTTTCTTATGTTTCCCCTCTTTTGTCTTTGAAGTCCCAAACACCAATAAGGA

ExpectationsManaged commented 1 year ago

Apologies for the delayed response. For this locus, you should consider looking at the Sequence Structure Guide maintained by the ISFG working group on nomenclature.

https://strider.online/nomenclature https://strider.online/bundles/strbaseclient/downloads/Forensic_STR_Sequence_Structure_Guide_v5.xlsx

According to the guide above, your 5' anchor includes part of the repeat region which is not ideal. I used the following to isolate the repeat region in the fastq data you sent and both anchor substrings were found.

D6S477-4 AUTOSOMAL TTGCAAAACAA TCATCTTATATGTG TCTATCTATCTA 4 0

As to WHY the anchors were not matching as you had them, the first and third have a low edit distance to the repetitive element. When this happens, STRait Razor may find matches within the repeat itself or, worse still, find multiple matches. I find it best to try and keep the anchors dissimilar to the repeat region (or any other anticipated string within the read).

I hope this helps and let me know if you need anything else.

ExpectationsManaged commented 1 year ago

@yangjw1996

image

yangjw1996 commented 1 year ago

Thank you very much! It seems I made a mistake on the core sequence of this locus and set the 5‘ anchor where overlap with the motif repeat region.
Thank you for your advice! It helps me a lot!