crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.44k stars 1.62k forks source link

Add bindings/API for LibXML SAX parsing #9048

Open Blacksmoke16 opened 4 years ago

Blacksmoke16 commented 4 years ago

I spent some time yesterday trying to improve the memory usage of https://github.com/Blacksmoke16/oq when parsing large XML documents into JSON. However as I learned more about libXML I realized what I want to accomplish is not possible with the current bindings.

XML::Reader uses http://xmlsoft.org/html/libxml-xmlreader.html, which from what I can gather, internally creates a DOM tree representation of the data. Because of this, when going from XML to JSON, the memory consumption is often quite bad; 300mb document uses ~1.5 gigs of memory. I'm sure my application code adds some additional ovehead.

The reverse process is much more efficient. Converting the JSON representation of the 300mb XML file back to XML barely uses 10mb due to the streaming aspects of JSON::PullParser and XML::Builder.

After doing some research, it turns out libXML has an API intended for parsing large documents using callbacks called SAX.

I propose that bindings should be added for this other API, and that we should discuss how we wish the Crystal side of the API to function, probably a new type that exists as XML::Parser.

asterite commented 4 years ago

I think we should have an XML::PullParser. The way to implement would be to bind SAX just to implement that. In my experience programming with a class that receives events ends up being a big pile of messy code.

RX14 commented 4 years ago

:+1: for PullParser. The concept is already used for JSON and YAML.

Blacksmoke16 commented 3 years ago

Another alternative could be binding https://github.com/libexpat/libexpat and providing it as a more stream based XML parser as a shard.