Open Blacksmoke16 opened 4 years ago
I think we should have an XML::PullParser. The way to implement would be to bind SAX just to implement that. In my experience programming with a class that receives events ends up being a big pile of messy code.
:+1: for PullParser
. The concept is already used for JSON
and YAML
.
Another alternative could be binding https://github.com/libexpat/libexpat and providing it as a more stream based XML parser as a shard.
I spent some time yesterday trying to improve the memory usage of https://github.com/Blacksmoke16/oq when parsing large XML documents into JSON. However as I learned more about
libXML
I realized what I want to accomplish is not possible with the current bindings.XML::Reader uses http://xmlsoft.org/html/libxml-xmlreader.html, which from what I can gather, internally creates a DOM tree representation of the data. Because of this, when going from XML to JSON, the memory consumption is often quite bad; 300mb document uses ~1.5 gigs of memory. I'm sure my application code adds some additional ovehead.
The reverse process is much more efficient. Converting the JSON representation of the 300mb XML file back to XML barely uses 10mb due to the streaming aspects of
JSON::PullParser
andXML::Builder
.After doing some research, it turns out
libXML
has an API intended for parsing large documents using callbacks called SAX.I propose that bindings should be added for this other API, and that we should discuss how we wish the Crystal side of the API to function, probably a new type that exists as
XML::Parser
.