FePhyFoFum / phyx

phylogenetics tools for linux (and other mostly posix compliant) computers
blackrim.org
GNU General Public License v3.0
111 stars 17 forks source link

pxaa2cdn -r (again) #177

Closed NatJWalker-Hale closed 1 year ago

NatJWalker-Hale commented 1 year ago

Hi Joseph,

Not actually an issue this time, but maybe a feature request. Because of transcriptome data, I not uncommonly deal with mixtures of complete and incomplete transcripts, i.e. some with stop codons and some without. In this case, -r works for sequences with stop codons, but also trims the last (legitimate) codon for those without, which then produces a length mismatch.

Of course, this is kind of an edge case and I can always just use a separate script to drop the stop codons from the complete transcripts, but I wonder if you might consider having -r (or perhaps a separate option), actually recognise [TAG, TGA, TAA] and remove them if present?

Thanks,

Nat

josephwb commented 1 year ago

I will look at this today. Could you provide an example input file?

NatJWalker-Hale commented 1 year ago

Sure - here is a simple two-sequence example case (taken from a larger alignment, hence the gaps).

tmp.pep.aln.txt tmp.cds.txt tmp.cds.aa2cdn.aln.txt

pxaa2cdn -a tmp.pep.aln -n tmp.cds -o tmp.cds.aa2cdn.aln gives

Error: for taxon 'Bvulgaris_548_EL10_1_0@14052' nucleotide alignment involves 544 codons, but protein alignment involves 543 amino acids. Skipping.

while pxaa2cdn -r -a tmp.pep.aln -n tmp.cds -o tmp.cds.aa2cdn.aln gives

Error: for taxon 'LDEL@61442' nucleotide alignment involves 564 codons (after removing final codon), but protein alignment involves 565 amino acids. Skipping.

Ideally, the latter would recognise the absence of a stop codon in LDEL@61442 and not remove the last codon, then align it instead of skipping it.

josephwb commented 1 year ago

I hope to get to this today. Sorry for the delay.

NatJWalker-Hale commented 1 year ago

No worries! Not a rush at all.

josephwb commented 1 year ago

Sorry for the delay. Implemented as of 2824ec0. Seems to work for your test files. Can you try on a larger data set to confirm things? The new option is -s.

I've retained the original -r option (you may only use one at a time), but -s seems to work as a super-option i.e. it will work as -r if all seqs have a stop codon. Do you think it is worth keeping -r? Might be confusing?

NatJWalker-Hale commented 1 year ago

Hey, sorry for slow response. Thanks so much for implementing! I guess it makes sense to keep -r just in case there are non-stop codon reasons people want to remove the last codon? I can't think what those would possibly be, but it's always good to keep flexibility.

josephwb commented 1 year ago

Kewl, thanQ.

NatJWalker-Hale commented 1 year ago

One last thing from a different set of testing - would it be possible to have lowercase sensitivity (even thought it is not standard)? I have some datasets with some sequences in lowercase mixed in (e.g. where the tail end comes from repeat-masked section). Would be great to have the stop codons be [TAA, taa, TGA, tga, TAG, tag], but appreciate it is a weird request and reasonable to expect all upper.

josephwb commented 1 year ago

Yup, that is easily doable. Will take 2 minutes when I get a chance today. In the meantime programs like pxs2fa etc. have a -u uppercase export option. But like I said, I'll get to this in a bit.

josephwb commented 1 year ago

Should be working with 9c46bce. Probably good to have the case supported in case ones file contains all lowercase.