elecena / xml-iterator

Memory efficient and fast XML parser with iterator interface
https://packagist.org/packages/elecena/xml-iterator
MIT License
0 stars 0 forks source link

Parser suddenly quits while parsing Discogs dump #13

Open aleksblendwerk opened 1 month ago

aleksblendwerk commented 1 month ago

Hi there,

as I am currenty looking to speed up my database import code for Discogs' dump files, I just tried your library with this file: https://discogs-data-dumps.s3-us-west-2.amazonaws.com/data/2024/discogs_20240701_labels.xml.gz and I might be using it wrong anyway but it also seems to stop after a couple thousand nodes.

This is more or less my code:

$stream = fopen('compress.zlib://[...]/discogs/discogs_20240701_labels.xml.gz', 'rb');

foreach (new XMLParser($stream) as $node) {
    if ($node instanceof XMLNodeContent && $node->name === 'label') {
        var_dump($node->content);
    }
}

fclose($stream);

The output ends with

string(67) "https://web.archive.org/web/20160427071301/http://www.exogenic.com/"
string(17) "Breakbeat Science"
string(17) "Breakbeat Science"

Somehow parsing suddenly ends at about 1% into the file.

I haven't investigated this further yet, will look elsewhere for now but I just thought I'd report it.

macbre commented 1 month ago

@aleksblendwerk - first of all, thanks for giving my library a try and reporting the bug!

Is PHP reporting any error? Is the gzip'ed XML properly formatted? What's the exit code of that script when you run it?

aleksblendwerk commented 1 month ago

@aleksblendwerk - first of all, thanks for giving my library a try and reporting the bug!

You're welcome!

Is PHP reporting any error? Is the gzip'ed XML properly formatted? What's the exit code of that script when you run it?

PHP doesn't report any error and the process just exits normally, exit code 0. A timestamp I echo after the fclose is also printed.

The XML should be fine, I successfully loaded it using PHP's built-in XMLReader.

One thing I noticed in the given XML file is that within the label nodes it might contain a sublabels node with child nodes called label again. Maybe that's a case you haven't encountered with your parser before.

macbre commented 1 month ago

One thing I noticed in the given XML file is that within the label nodes it might contain a sublabels node with child nodes called label again. Maybe that's a case you haven't encountered with your parser before.

Might be. Can you submit the XML you're trying to parse? Or at least a small sample that can be used to reproduce the problem?

aleksblendwerk commented 1 month ago

Might be. Can you submit the XML you're trying to parse? Or at least a small sample that can be used to reproduce the problem?

It is the file I linked in the initial post:

https://discogs-data-dumps.s3-us-west-2.amazonaws.com/data/2024/discogs_20240701_labels.xml.gz

As far as providing a small sample to reproduce it, that would probably require me to dig in too deep right now.