ShawHahnLab / chiimp

Computational, High-throughput Individual Identification through Microsatellite Profiling
GNU Affero General Public License v3.0
2 stars 3 forks source link

Parsing ambiguity in primers & complex STR motifs? #103

Open jmmccar opened 5 months ago

jmmccar commented 5 months ago

Hello,

I'm wondering if anyone can assist with some issues I'm having re: locus assignment.

Some of my forward primers include ambiguous nucleotides (K, W, R, etc.) and some of the expected STR motifs are combinations of nucleotide strings (e.g. AC + TG). I have not been able to get CHIIMP to successfully assign sequences to any locus that features either or both of these characteristics - are these not features that CHIIMP can parse, or am I maybe just passing the information in the incorrect format in my locus attribute file? image

I appreciate any assistance!

ressy commented 5 months ago

Hi Jenn,

I'll answer the second part first since it'll be simpler: compound STR motifs aren't currently supported by CHIIMP, but you can probably work around this pretty easily. All the code currently does with the motif information is filter candidate sequences by requiring several perfect tandem repeats, so if your target amplicons should always have a perfect region of one or the other repeating motif, just use that one for the motif column. (For example if D12S372 amplicons should always have at least AGATAGATAGAT somewhere in the sequence, you could just keep AGAT for the motif and keep the default setting of 3 for the minimum number of repeats.) You can also adjust that nrepeats number in the configuration, or if the sequences are just too complex to reliably capture with that kind of match, you can even leave the motif column blank and it will filter only by primer match and length range. That should be fine so long as you don't expect off-target sequences getting amplified by your primers.

The support for ambiguity codes in primer sequences has been an intended feature for a while now, and while I already have a new feature for it in the latest development version of the code, I think there are still some bugs that are preventing people from using it. For one thing there's a filter that excludes sequences with any non-ACTG characters (originally intended to exclude those with low-quality N bases) but that directly conflicts with any ambiguity codes in the primer sequences. Until I fix that you could try using the latest code with the new option to remove primer-matched regions in the reported allele sequences, though that means the reported allele lengths will likewise exclude those parts. Since so much of the code has changed since the last official version I worry there may be other lingering bugs, and it really does need more testing before there's a new version.

I've been meaning for a long while now to do a full test with a simulated dataset to iron out any remaining bugs, but we're very busy with other projects so it's been hard to find the time. If you'd be comfortable sharing a dataset you're trying to analyze along with your config+metadata (you can email me directly at ancon@upenn.edu) I could try getting it working myself and report back, though unfortunately I don't know what the turnaround time on that would be like. I can also try to give you tips for configuring the latest dev CHIIMP code so you can try it yourself. You'd just have to install from the dev branch of the repository here rather than from the 0.4.1 version.

Hope that helps!

Jesse

jmmccar commented 5 months ago

Hi Jesse,

Thank you for such a quick response & so much helpful information! The work around you provided for the compound STR motifs seems to be working well, so I should be taken care of on that front.

I'm trying to get these sequences processed asap, so I may try out using the remove primer-matched regions feature first and see how that goes. I may also try running the loci with primers featuring ambiguous nucleotides as split 'separate' loci (e.g. CAKT as both CATT and CAGT) & then combining the outputs. If neither of these things seems feasible, I will definitely give the dev version a go & reach out if I need any assistance with configuring code!

Again, thank you so much for the response - I really appreciate your time and assistance!

Jenn