bovee / entab

* -> TSV
MIT License
21 stars 5 forks source link

Support XML #8

Open bovee opened 4 years ago

bovee commented 4 years ago

This is mostly necessary to support XML-based file formats like some of the Agilent MassHunter formats, mzML, etc.

There are a couple existing streaming XML Rust parsers that we could possibly wrap, but it may be "easy" enough to just write one on top of the existing ReadBuffer interface: https://github.com/netvl/xml-rs https://github.com/tafia/quick-xml

Passing a raw XML file into entab should probably result in a stream with fields like:

We may not want to actually do that though because it will probably require saving up all the data and emitting the nodes post-traversal (which isn't the most natural format to view and requires more memory).

bovee commented 2 years ago

Rather than storing the path as a Vec<String>, we could store it as a Vec<usize> of "delimiter" positions (e.g. for XML a > and for JSON maybe a ,) and a Vec<u8> that we memcpy each new tag into at the end (overwriting existing tags as we move up and down the stack). This would allow direct comparisons between the current state and a search as long as we don't need to track tag attributes (so it would work better for JSON, but I think we could shoehorn it in here too).