USPTO / PatentPublicData

Utility tools to help download and parse patent data made available to the public
Other
180 stars 81 forks source link

TransformerCli slows, then blows up after processing ~1000 number of records (with 300MB heap). #88

Closed zanerock closed 4 years ago

zanerock commented 5 years ago

While processing the weekly dump of granted patent records, TransformerCli first slows down and then hangs. This has all the hallmarks of a memory issue, but I have not dug into debugging to confirm.

By "slow down", I mean that when watching the output logs to stdout, it processes fewer and fewer records before pausing, eventually processing ~1-4 records before hanging indefinitely.

FIxing this issue or #87 (which would allow processing of the data in chunks) seems to be necessary for processing large data sets.

Executed with:

PROJECTPATH=$( cd $(dirname $0)/.. ; pwd -P )
CLASSPATH="${PROJECTPATH}/lib/*:${PROJECTPATH}/lib/dependency-jars/*"
JAVA="java -cp ${CLASSPATH} -Dlog4j.configuration=file:${PROJECTPATH}/conf/log4j.properties"
${JAVA} gov.uspto.patent.TransformerCli --input "$FILE" --stdout

Where FILE was tested with ipg190326.zip (which would consistently hang at record 991), ipg190402.zip (hanging at record 1041), and ipg190409.zip (hanging at record 1017). Each of those with a heap size of 300MB (as I remember it).

zanerock commented 5 years ago

The behavior would be consistent with an increasing frequency of garbage collection, though an OutOfMemory exception is never thrown (at least, before I just kill the process). If the maintainers could weigh in on which they believe would be easier to address, I may be able to work on this issue or #87 in order to enable processing of arbitrarily large data sets.

bgfeldm commented 5 years ago

There are some huge patents, some have lots of huge tables, which take more than the java's default memory threshold to process. And with the way the bulk files are written, in a sequential fashion, the patents get larger and more complex around the same area of each bulk file.

Patents with about 100MB in text, are handled within the gov.uspto.patent.PatentReader which either skips them or drops the large fields and continues. A 100MB patent once read into a DOM will be 3+ times the size. The description field is read into a DOM twice, which requires lot of memory on these large patents.

I have two suggestions: 1) Try setting a large max memory with -Xmx2g 2) Try using the newer transformer which supports skip at gov.uspto.bulkdata.cli.Transformer

zanerock commented 5 years ago

That makes sense, I'll give that a try.