dlce-eva / python-nexus

python-nexus - Generic nexus (.nex, .trees) reader/writer for python
BSD 2-Clause "Simplified" License
11 stars 3 forks source link

Make parsing of translate tables in tree blocks more robust #31

Closed SimonGreenhill closed 2 years ago

SimonGreenhill commented 2 years ago

This PR fixes a bug where trees with a translation block ignored taxa with asterisks in their name. These taxa labels would be silently ignored, i.e.

#nexus
begin trees;
    translate
        1 Simon,
        2 *Russell,
        3 Robert,
        tree tree = (1,2,3);
end;

..would mean that the taxon "*Russell" would be lost, and the tree would look like (Simon,2,Robert).

I believe that asterisks are invalid in taxa names anyway, but they are found in the wild (e.g. Grollemund's Bantu dataset).

It might be better to fix the bigger issue -- the translate table should not be able to have a different number of taxa in it to those in the trees. But I could not see an easy way to enforce this.

codecov-commenter commented 2 years ago

Codecov Report

Merging #31 (2623c33) into master (296c521) will not change coverage. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##            master       #31   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           59        59           
  Lines         2758      2774   +16     
=========================================
+ Hits          2758      2774   +16     
Impacted Files Coverage Δ
src/nexus/exceptions.py 100.00% <100.00%> (ø)
src/nexus/handlers/tree.py 100.00% <100.00%> (ø)
tests/test_cli.py 100.00% <100.00%> (ø)
tests/test_handler_TreeHandler.py 100.00% <100.00%> (ø)
tests/test_regressions.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 296c521...2623c33. Read the comment docs.

SimonGreenhill commented 2 years ago

ahh, figured out a way to check for this. We now raise an exception in detranslate if the Translate table doesn't match the tree.

SimonGreenhill commented 2 years ago

...and this popped up another bug. If there's no translate section in the trees block, we parse the first tree to identify taxa names. However this can be messed up with comments in the tree like those found here, e.g.:

...(hax[&length_mean=0.1146902001157983,height_median=2.775557561562891E-17,prob(percent)="100"...

where "percent" inside the comment is identified as an extra taxa. The PR now strips comments from the tree string before doing this, which is a little brute force but the comment info isn't needed here.

xrotwang commented 2 years ago

...(hax[&length_mean=0.1146902001157983,height_median=2.775557561562891E-17,prob(percent)="100"...



where "percent" inside the comment is identified as an extra taxa. The PR now strips comments from the tree string before doing this, which is a little brute force but the comment info isn't needed here.

Wow. I would have expected that braces in comments would crash the newick parsing hard. But apparently it's only square brackets that can't appear in comments ... I guess, comments in newick in Nexus is just another PHP ... and we are writing the parser ...

xrotwang commented 2 years ago

Ah, it does https://github.com/phlorest/sicoli_and_holton2014/blob/cd41d90973b8838879b3bcbba79c07051081745c/cldfbench_sicoli_and_holton2014.py#L7-L9 Makes me sort-of happy.