FePhyFoFum / phyx

phylogenetics tools for linux (and other mostly posix compliant) computers
blackrim.org
GNU General Public License v3.0
111 stars 17 forks source link

pxaa2cdn does not work anymore on the example files #143

Closed nileshpatra closed 3 years ago

nileshpatra commented 3 years ago

Hi,

On doing analysis on the example data provided, pxaa2cdn does not output any files, and finishes with error:

$ cd example_files/pxaa2cdn_example/
$ pxaa2cdn -a AA_Alignment.fa -n Unaligned_Nucleotide.fa -o out
Error: nucleotide sequence length for 'Drosophila_setosifrons' is not a multiple of 3.
Error: nucleotide sequence length for 'Drosophila_adunca' is not a multiple of 3.
Error: nucleotide sequence length for 'Drosophila_oahuensis' is not a multiple of 3.
Error: nucleotide sequence length for 'Drosophila_picticornis' is not a multiple of 3.
Error: nucleotide sequence length for 'Drosophila_nigribasis' is not a multiple of 3.
Error: nucleotide sequence length for 'Drosophila_silvestris' is not a multiple of 3.
Error: nucleotide sequence length for 'Drosophila_heteroneura' is not a multiple of 3.
Error: nucleotide sequence length for 'Drosophila_differens' is not a multiple of 3.
Error: nucleotide sequence length for 'Drosophila_ingens' is not a multiple of 3.
Error: nucleotide sequence length for 'Drosophila_neopicta' is not a multiple of 3.
Error: nucleotide sequence length for 'Drosophila_substenoptera' is not a multiple of 3.
Error: nucleotide sequence length for 'Drosophila_neoperkinsi' is not a multiple of 3.
Error: nucleotide sequence length for 'Drosophila_melanocephala' is not a multiple of 3.
Error: nucleotide sequence length for 'Drosophila_hanaulae' is not a multiple of 3.
Error: nucleotide sequence length for 'Drosophila_cyrtoloma' is not a multiple of 3.
Error: nucleotide alignment does not appear to be in frame. Exiting

Version 1.01 however outputs the required output. Please look into it. And please consider changing the example data for this one if it's a problem with input itself.

josephwb commented 3 years ago

I will look into this. The sequences are definitely not multiples of 3. The question is whether this necessarily needs to be the case...

josephwb commented 3 years ago

@jfwalker should we require the nucleotide alignment to be a multiple of 3 in length, or do something with "extra" nucleotides (i.e., ignore)?

jfwalker commented 3 years ago

Hmmm...That's a good question. I think in general for peace of mind it's good to make sure they are in length of three as sort of a double check that the AA sequence corresponds to the nucleotide. We could have some type of override option where it trims the excess but I'm not sure how often or for what reasons that issue would arise.

josephwb commented 3 years ago

I can confirm that the check ("not a multiple of 3") did not exist in v1.01.

I'm actually not sure why it did not crash, because it would try to add nucleotides that did not exist i.e., would refer to index positions outside of the string length. Weird. In at least one example I tried with only 1 nucleotide in the last codon it found nothing for the second nucleotide but somehow found a third nucleotide?!?

Input:

>Sequence1
ATGAAAA
>Sequence2
ATG
>Sequence3
ATGATGATGATGAT
>Sequence5
ATGATGATGATGAT

Output:

>Sequence1
ATG---AAA---AA
>Sequence2
ATG------------
>Sequence3
ATGATGATGATGAT
>Sequence5
ATGATGATGATGAT

How did Sequence1 get 5 terminal As?!?

Anyway, the old code was bad, the new code is correct. I think forcing the user to pass in correct data is reasonable. The only issue here is a bad example, which I will now fix.

josephwb commented 3 years ago

@jfwalker you came up with the (Drosophila) example data. Do you know how 15 (out of 19) nucleotide sequences are not a multiple of 3? Is this likely to happen?

josephwb commented 3 years ago

@nileshpatra I replaced the example with one that works. Thanks for pointing that out.

nileshpatra commented 3 years ago

Many thanks! \o/