JLSteenwyk / ClipKIT

a multiple sequence alignment-trimming algorithm for accurate phylogenomic inference
https://jlsteenwyk.com/ClipKIT/
MIT License
63 stars 4 forks source link

Trim CDS sequence #19

Closed jsharbrough closed 3 years ago

jsharbrough commented 3 years ago

I'd like to be able to keep track of CDS & protein sequences throughout the trimming process, and I don't want to trim CDS sequences in a way that would introduce frameshifts (i.e., trimming at the codon level). Gblocks can do this, but I'd like to see if this method can do a better job than Gblocks. Essentially, this involves translating the CDS sequence to protein sequence, trim the protein sequence (keeping track of which positions get trimmed), and then reverse-translating back to CDS sequence.

Alternatively, is there a way to obtain the trimmed positions from an amino acid alignment? The -c option is close, but becomes ambiguous in gappy alignments. Better would be a list of aa positions that were trimmed. I can then take that list and reconcile the untrimmed CDS with the trimmed protein sequence alignment. Perhaps this function is already possible and I am just not reading the documentation thoroughly?

Thanks!

jsharbrough commented 3 years ago

Ah, looks like the -l flag provides the necessary information for me to reconcile the trimmed amino acid sequence with the cds sequence. Nice addition! I wrote a little script to do it, in case anyone else needs this function:

https://github.com/jsharbrough/protTrim2CDS

JLSteenwyk commented 3 years ago

Hi @jsharbrough!

Thanks for bringing up this issue, resolving it, and providing a nifty helper script!

I thought about implementing codon-based alignment trimming, however, implementing such a method becomes tricky when alternative genetic codes are used by some, but not all, taxa in the alignment. For the time being, I will not implement a codon-based trimming approach but will keep this in mind for future versions of ClipKIT.

Thank you for your interest in ClipKIT. You may find some of the other software I have developed such as PhyKIT (a toolkit for processing and analyzing multiple-sequence alignments and phylogenetic trees, BioKIT (a broadly applicable toolkit for processing and analyzing sequence data), and orthofisher (a toolkit for extracting putative orthologs from proteomes) useful for your research needs.

All the best,

Jacob

JLSteenwyk commented 2 years ago

Hi @jsharbrough,

I can't express enough how cool it is to see you build a tool to complement ClipKIT, and it makes me feel like my efforts to build tools have been worthwhile.

To address your previous question, I have updated a function in PhyKIT, the thread_dna function, which can now accept as an additional argument that takes an input a ClipKIT generated log file. PhyKIT will then generate the corresponding trimmed codon alignment. This function is available as of PhyKIT, v1.11.5.

All the best,

Jacob

jsharbrough commented 2 years ago

Wonderful @JLSteenwyk, thank you for implementing, I'm excited to check it out! And yes, this is a super useful set of tools, so well done!

Best,

Joel