dlce-eva / python-nexus

python-nexus - Generic nexus (.nex, .trees) reader/writer for python
BSD 2-Clause "Simplified" License
11 stars 3 forks source link

first stab at v3 #43

Closed xrotwang closed 1 year ago

xrotwang commented 1 year ago

On token/block/command level, this parser implementation should already be quite complete. E.g. we can do

nex = Nexus.from_file(Path('tests/regression/detranslate-beast-2.trees'))
trees = nex.first_block(name='TREES')
trees.commands['TRANSLATE'].text_lines
['1 A,', '2 B,', '3 C,', '4 D,', '5 E,', '6 F,', '7 G,', '8 H,', '9 I,', '10 J']
xrotwang commented 1 year ago

The nexus.parser.Nexus object stores tokens (i.e. (text, type) pairs) of a Nexus file. All subsequent reading is then based on this token list (which I'd expect to be quick enough).

I'd imagine writing could happen through methods add_block or add_taxon on Nexus, which manipulate this token list.

xrotwang commented 1 year ago

In terms of performance, the effect of this parser is mixed: With the old NexusReader, "reading" the 1000 trees from the ~60MB posterior of the Koile-tree is less than 1sec, with the new parser it's ~14secs. This looks dramatic, but the picture changes a lot when the newick of the trees is parsed. This hikes up the cost to ~105secs for the old method, and to 125secs for the new one .

Overall, I'd think that's acceptable. And with the new parser, we could still implement some correctness/speed trade-offs like "chunk on BEGIN and END before parsing the block content". Having read through the Nexus paper, it seems that the current naive parsing will lead to more problems in the future. (E.g. I've been somewhat surprised that comments in block names are explicitly allowed :) )

It's a just unfortunate that the "comments" and "quoted strings" feature which is built into Nexus basically means that you cannot stream Nexus, but have to parse to the end before knowing what's what.

xrotwang commented 1 year ago

I also like this API:

>>> nexus.blocks['TREES'][0].commands['TREE'][0].newick.node.walk()
xrotwang commented 1 year ago

It may really be good to turn python-nexus into a full-fledged Nexus reader (and somewhat full-fledged writer). The only somewhat comprehensive python solution I found - Bio.Nexus.Nexus - crashes on the Koile posterior - after 7 secs:

$ time python test.py 
Traceback (most recent call last):
  File "/home/robert/projects/biopython/test.py", line 4, in <module>
    nex = Nexus.Nexus('../dplace/trees/koile_et_al_subm/cldf/posterior.trees')
  File "/home/robert/venvs/biopython/lib/python3.10/site-packages/Bio/Nexus/Nexus.py", line 667, in __init__
    self.read(input)
  File "/home/robert/venvs/biopython/lib/python3.10/site-packages/Bio/Nexus/Nexus.py", line 731, in read
    self._parse_nexus_block(title, contents)
  File "/home/robert/venvs/biopython/lib/python3.10/site-packages/Bio/Nexus/Nexus.py", line 772, in _parse_nexus_block
    getattr(self, "_" + line.command)(line.options)
  File "/home/robert/venvs/biopython/lib/python3.10/site-packages/Bio/Nexus/Nexus.py", line 1171, in _tree
    raise NexusError(f"Syntax error in tree description: {options[:50]}")
Bio.Nexus.Nexus.NexusError: Syntax error in tree description: posterior-1 = ((((Aghem_Grassfields[&location={6.3

real    0m7,986s
user    0m7,893s
sys 0m0,092s
xrotwang commented 1 year ago

I'm down to parsing the 60MB Nexus in about 10secs. And I really like the API, which - with just a tiny bit of syntactic sugar - now looks like

Nexus.TREES.TREE.newick.node.walk()

(uppercase attributes resolve to the first instance of a block with name matching the attribute)

Since Nexus is just a list of commands, it also allows editing - on command level - quite easily:

>>> nex = Nexus('#NEXUS')
>>> print(nex)
#NEXUS
>>> nex.append_block('TREES', [('TREE', 'post = (a,b)c')])
>>> print(nex)
#NEXUS
BEGIN TREES;
TREE post = (a,b)c;
END ;

Overall, though, I think the best way forward would be an entirely new package. Maybe called commonnexus - in analogy to commonmark?

xrotwang commented 1 year ago

So I started a new package and the tests show a bit how I imagine the editing story to go: https://github.com/dlce-eva/commonnexus/blob/12d8921fb336f5fd6c28f1217c828c6d9aa6fa5a/tests/test_nexus.py#L73-L85

I think, implementing the API of current python-nexus on top of this would be possible, but also possibly quite ugly. So I'd prefer to have two separate packages with one possibly about to be phased out.

SimonGreenhill commented 1 year ago

...this looks cool, thanks for looking into it. I'm very keen to see how this works on the full test set, but I'm worried about creating a new package as there's lots of dependencies on python-nexus (at least in my workflow & phlorest!). Is there a way to patch this into python-nexus and do some radical deprecations to clean up the API?

xrotwang commented 1 year ago

After a bit more work, I think a new package is the way to go. The trade off between speed and correctness is very real, so I think both packages have a place. I could see python-nexus using commonnexus.tokenizer, though, for tricky stuff like translate command content.

Simon J Greenhill @.***> schrieb am Mi., 8. Feb. 2023, 22:20:

...this looks cool, thanks for looking into it. I'm very keen to see how this works on the full test set, but I'm worried about creating a new package as there's lots of dependencies on python-nexus (at least in my workflow & phlorest!). Is there a way to patch this into python-nexus and do some radical deprecations to clean up the API?

— Reply to this email directly, view it on GitHub https://github.com/dlce-eva/python-nexus/pull/43#issuecomment-1423253333, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGUOKCF5YC4AS32XN5264TWWQE2ZANCNFSM6AAAAAAUT3HAFI . You are receiving this because you authored the thread.Message ID: @.***>

xrotwang commented 1 year ago

@SimonGreenhill I did some optimization of the python-newick parser, and integrated this into commonnexus (i.e. commonnexus re-uses the Nexus parse-tree to instantiate newick Nodes from parsed tokens directly). This brings down reading 1000 trees in a 60MB nexus to ~40secs - which isn't just about a third of what it was with our old toolset, but it's also less than two thirds of the time dendropy.TreeList.get takes.

So after being about to abandon the commonnexus attempt, because dendropy seemed pretty feature-complete when it comes to reading Nexus, I'm now almost convinced that commonnexus has something to offer.

xrotwang commented 1 year ago

@SimonGreenhill I should note that python-nexus with the to-be-released newick 1.7 is at 30secs for the 1000 trees :) So passing around pre-parsed tokens doesn't seem to be that big of a deal.

SimonGreenhill commented 1 year ago

ok, looks promising, but am still concerned with the amount of work to switch everything over. How does this work on the various weird nexus files here and here

xrotwang commented 1 year ago

ok, looks promising, but am still concerned with the amount of work to switch everything over. How does this work on the various weird nexus files here and here

I'm porting the regression tests right now. In the process, I noticed that some of the test NEXUSes, e.g. bad_chars_in_taxanames.trees, are malformed in an unintended? way: There's no END; command but just a ;. I'd assume these files are hand-written, and python-nexus is not supposed to accept blocks without an END;?

SimonGreenhill commented 1 year ago

These are generally nexuses I've encountered in the wild and needed to parse but, in this case, given the name of the test, I think it's probably ok to add the END;

xrotwang commented 1 year ago

Another weirdness is this newick tree supposedly created by mesquite:

(1:0.1,2:0.2[%color = 5 ]):0.4,3:0.3[% ] [% ] [%  setBetweenLong = color ];

python-newick handles the comments, but chokes on the siblings :0.4,3:0.3 which are not enclosed in (). AFAIKT this is invalid newick. Is it possible that just parts of a newick string were pasted into mesquite_formatted_branches.trees to test the comment handling?

xrotwang commented 1 year ago

Looking at other files created by Mesquite that problem doesn't seem to pop up, e.g. https://github.com/AgneseLan/ontogeny-asymmetry-dolphin/blob/07c7f7d5337129fa8c281a457bd7ee2fde20ce0a/Data/tree_odontonly.txt

xrotwang commented 1 year ago

glottolog.trees lists taxa right in the TAXA block, and not as payload of a TAXLABELS command. I think straying away from the NEXUS spec that far shouldn't be supported.

xrotwang commented 1 year ago

The test using the glottolog.trees seems to be there to show that duplicate taxon labels are supported to some extent. But I'm not sure this should be supported at all - considering that order of taxon labels is significant and taxa can be referred to by index from all sorts of places.

xrotwang commented 1 year ago

To some extent, the above issues seem to highlight the need for a spec-compliant NEXUS reader - possibly with well-defined, configurable exceptions, such as allowing - or * within unquoted taxon labels.

SimonGreenhill commented 1 year ago

I think you're right about that mesquite test, and yes, let's fail on duplicate taxa names. That should be a broken tree anyway.

xrotwang commented 1 year ago

Re-implementing the tools from python-nexus in commonnexus highlights the potential of commonnexus (and limitations of python-nexus, respectrively) quite clearly, I think. E.g. combining data matrices becomes a lot more robust, e.g. allowing taxa referenced by number in one matrix and by label in another (see https://github.com/dlce-eva/commonnexus/blob/e5be69af7daf931ae7a76ac3282f96b56c0d9ab9/tests/test_tools_combine.py#L61-L90); or combining trees with or without TRANSLATE, see https://github.com/dlce-eva/commonnexus/blob/e5be69af7daf931ae7a76ac3282f96b56c0d9ab9/tests/test_tools_combine.py#L99-L108