DOMTagger & DOMSplitter and XML file size

Hi,

I've been using a lot DOMTagger and DOMSplitter in my crawlers, as I'm used to this way of simply extracting data from webpages (note: I come from the Heritrix world using XPathes...).

In the docs: http://www.norconex.com/collectors/importer/latest/apidocs/ I read: "This class constructs a DOM tree from the document content. That DOM tree is loaded entirely into memory. Use this splitter with caution if you know you'll need to parse huge files. It may be preferable to use a stream-based approach if this is a concern."

I have a question about what do you mean by "huge"? How big do you think XML files can be?

Indeed, I recently started crawling files from: https://pairbulkdata.uspto.gov/ . Inside the "xml" zip file provided on this page, there are XML files listing patents, split by year. For the recent years (>2000), these XML files can grow up to 6GB++.

I assumed this was actually huge, so before running any norconex collector, I used Perl::xml_xplit to split these XML files into smaller chunks. I started with chunks of 150MB. When I started my filesystem-collector in a JVM with a max of Heap of 8GB, I thought this would be reasonable, but the code seemed to "freeze" very rapidly (0 CPU usage, in Sleep state, 10GB virt mem). Thus I reduced the size of the chunks up to 50MB and restarted the code, but I'm still facing the same problem again: the first split is treated (very slowly: in at least a day), then the second split seemed frozen (for 4 days now).

Is actually 50MB to big for an XML file to be treated with a combination of DOMTaggers & DOMSplitter?

NOTE1: the filesystem-collector is configured to use only one Thread. NOTE2: I attempted to understand what the processes are doing. Thanks to https://meenakshi02.wordpress.com/2011/02/02/strace-hanging-at-futex/ , I found a few commands to help: ps -efL|grep <Process Name> and strace -p <processes id>. In my case, they returned:

[pid  2642] futex(0x66b5909459d0, FUTEX_WAIT, 2643, NULL <unfinished ...>
[pid  2636] wait4(-1, ^CProcess 2642 detached

So apparently I indeed have only a single child process that is wait(ing) for something and its parent process that is blocked by a futex.

NOTE3: I've also run "iotop" and the processes do not seem to have any IO activity...

As a consequence, I'm totally clueless about what they could be waiting on...

DOMTagger/DOMSplitter are not good classes to use for you in this case.

"Huge" is subjective, but try to create an object-graph from 50MB is probably quite a lot. Given it needs to create many objects in memory for each nodes and attributes, the total memory consumption is probably significantly larger than your file size.

There are generally two approaches of parsing XML files. One is loading it all in memory as a DOM-tree, and that's an approach meant for relatively small files. That's what the DOMTagger/DOMSplitter do. They have been created mainly to deal with web pages (which are usually very far from being 50MB). That's why there is the disclaimer in the documentation. The moment your file starts to get big, you should use a stream-based XML parser for increased speed and lower memory usage (e.g. a "SAX" parser). Unfortunately, there are currently no out-of-the-box tagger/splitter using a stream-based approach with XMLs. You would need to create one yourself.

So if simply adding more RAM does not seem to work for you, your option is probably to deal with much smaller files, or create a custom SAX-based DOMTagger/Splitter.

The DOMTagger/Splitter relies on JSoup (https://jsoup.org/) for parsing files and creating a DOM-tree. You may want to check there if people had better luck parsing large files, but in the end, what you really need is a SAX (or other stream-based) parser.

Norconex / importer

DOMTagger & DOMSplitter and XML file size #36