FePhyFoFum / phyx

phylogenetics tools for linux (and other mostly posix compliant) computers
blackrim.org
GNU General Public License v3.0
111 stars 17 forks source link

pxcomp outputs an enormous file, or rejects unaligned sequences. #175

Open Gullumluvl opened 1 year ago

Gullumluvl commented 1 year ago

Hi all,

I compiled phyx (commit b359a6d, 2022/08/19) on Ubuntu 20.04. The tests from make check all pass.

I am trying to compute the Chi square test of compositional homogeneity with pxcomp from fasta input, but it fails in different manners:

  1. nucleotides, aligned, ungapped:

    echo -e '>a\nACGTACAAGTCCATT\n>b\nACCTAAGGCTTCAAG' | pxcomp -o test_pxcomp.out

    This succeeds, but the output file is 4.1G ! It contains some gigantic lines and we have to be careful not to crash the computer while trying to open it:

    sed -n 'l80' test_pxcomp.out | head -n 20
            Observed character counts:$
                 A            C            G            T        Nchar$
    a            5            4            2            4           15$
    b            5            4            3            3           15$
    Total                                                                          \
                                                                                   \
                                                                                   \
                                                                                   \
                                                                                   \
                                                                                   \
                                                                                   \
                                                                                   \
                                                                                   \
                                                                                   \
                                                                                   \
                                                                                   \
                                                                                   \
                                                                                   \
                                                                                   \
                                                                                   \

    So this file is filled with 4G of space characters I suppose.

  2. nucleotides, aligned, with gap characters

    echo -e '>a\n-ACGTACAAGTCCATT\n>b\nA-CCTAAGGCTTCAAG' | pxcomp -o test_pxcomp.out

    output file is also 4.1G.

    1. nucleotides, unaligned, no gap

      echo -e '>a\nACGTA\n>b\nACCTAAGGCTTCAAG' | pxcomp -o test_pxcomp.out

    Fails with Error: sequences must be aligned. Exiting.. However for this kind of test, there is no need to have aligned sequences. The average alignment frequencies can be scaled to the given sequence size, is this implemented this way?

  3. amino-acids, aligned, no gap:

    echo -e '>a\nABCDEFGHIKLMNPQRSTVWXYZ\n>b\nAAADEFGHIKLMNPQRSTVWZZZ' | pxcomp -o test_pxcomp.out

    Fails with Error: missing/ambiguous characters not currently supported. Exiting.

Any ideas for fixing these problems, especially the file size problem?

josephwb commented 1 year ago

Yikes! I will look at it. Thanks for the issue.

josephwb commented 1 year ago

Regarding 3, that was a choice to go with aligned sequences. I'll think about relaxing this constraint. But an easy way around this at the moment is to just do a (poor) alignment first.

josephwb commented 1 year ago

@Gullumluvl The bug is very stupid and regards formatting the output. It assumes that the taxon labels are a certain length; since yours are only 1 character, it doesn't know what to do and dies. I will fix this shortly. Sorry for the hassle.

josephwb commented 1 year ago

@Gullumluvl should be fixed as of 0d0e7b9.

Please note this is a rough-and-dirty test of compositional heterogeneity. Better (tree-based) tests are available in p4 (and PAUP, I think).

Gullumluvl commented 1 year ago

Great, thank you! Ah yes maybe my dummy input was too dummy... The real dataset I tried first was failing because it's amino-acid, I guess? I wasn't aware of p4, thanks!

josephwb commented 1 year ago

@Gullumluvl Hmm not sure why it is not working on amino acids. It is working on binary data. I will take another look.