FePhyFoFum / phyx

phylogenetics tools for linux (and other mostly posix compliant) computers
blackrim.org
GNU General Public License v3.0
111 stars 17 forks source link

enforce multiple-of-3 for aa2cdn &pxtlate, but also add option to "ignore extra junk" #146

Open josephwb opened 3 years ago

josephwb commented 3 years ago

Found investigating #143. pxaa2cdn won't work if nucleotide sequence lengths are not a multiple of 3 (codons). However, pxtlate will translate as much as it can, and just disregard any "extra" nucleotides. Basically, I was unable to construct an example dataset for pxaa2cnd by using pxtlate, which is worrying,

Seems like both programs should share a policy: either enforce multiples of 3 in a draconian manner, or be chill and let what ever get passed through.

I like the "force the user to use good data" route, but do not do this enough to know how often nuc seqs will have "extra" nucleotides. And from working on the pxaa2cdn example I know how much of a pain it can be to wrangle seq lengths.

I guess @jfwalker would be the best to decide which policy above is most appropriate.

jfwalker commented 3 years ago

@josephwb I feel like this is something we talked about like 5 years ago and decided that it didn't need to be in multiples of three, but now I pretty strongly feel like it should. With all the automated pipelines that exists, the more checks the better.

josephwb commented 3 years ago

Thank you @jfwalker. I will put an enforced check. If necessary, I can put in a -f --force do-it-anyway option for both.

josephwb commented 3 years ago

Should also consider removing terminal (stop) codons here, as they can be a pain.

jbernot commented 2 years ago

+1 for "remove extra junk" option. I would love that feature. I am running into cases were "nucleotide alignment involves 191 codons, but protein alignment involves 190 amino acids. Skipping" In my cases, it is a leading codon that is absent in the AA alignment. I would love if it was possible to ignore the missing leading codon and output only the 190 codons that correspond to the 190 AA.