Issue when parsing large files

CoreOffice / XMLCoder

Easy XML parsing using Codable protocols in Swift

https://coreoffice.github.io/XMLCoder/

MIT License

801 stars 112 forks source link

Issue when parsing large files #251

Closed alexsteinerde closed 1 year ago

alexsteinerde commented 2 years ago

I'm having issues with parsing large files with more than 10MB on Linux (Docker). I researched the topic after finding out it occurs only with Data larger than 10MB. It seems like the libxml2 has a limit of 10_000_000 bytes hardcoded to prevent buffer overflows (XML_MAX_TEXT_LENGTH). But in my case, I would like to override this. The proposed solution is to pass an additional parameter (XML_PARSE_HUGE) to the libxml2 call. But as this is encapsulated in XMLParser I'm not sure if this is possible.

Did anyone of you encounter this issue before? If not, do you have any recommendations to approach the Foundation team to add this kind of option?

Big thanks in advance.

mflint commented 2 years ago

I don't know if it's possible - or even advisable - to increase that buffer size.

For large files, personally I'd start to investigate using a SAX parser, rather than a DOM parser.

Joannis commented 2 years ago

I haven't needed this yet, but XMLCoder does interact with FoundationXML rather than libxml2. I'm not aware of any options like this (yet). Regardless, I have to agree with @mflint here. It's unwise aadd very large files in this type of parser. I have discussed swapping out the parser under the hood, and/or supporting other parsing methods. But so far the other maintainers weren't fond.

alexsteinerde commented 1 year ago

Thanks for the answers. It seems as if this cannot get fixed because of the underlying implementation of the used frameworks (libxml2). On Linux, libxml2 is used by the FoundationXML framework as far as I understood (and requesting a change in this isn't worth it for me). In my use-case, the memory footprint didn't matter.

Now my working solution is to remove unneeded XML tags with a regex expression before parsing it. This is not perfect, but it did work for in case.