USPTO / PatentPublicData

Utility tools to help download and parse patent data made available to the public
Other
182 stars 80 forks source link

grant pba*.zip and yyyy.zip processing - TransformerCli fails #55

Open ThomasHeliase opened 7 years ago

ThomasHeliase commented 7 years ago

Attempting processing of older, greenbook grant_bibliographic format files either fails with heapsize or a java.util.NoSuchElementException.

Is this a case on completeness, or were these files never intended for processing and there is a better source?

a good example is 1990 or 1998 files, a single .dat file for the whole year, and weekly pba*.zip files are also supplied, which also don't load.

http://patentscur.reedtech.com/downloads/GrantRedBookBib/1990/1990.zip http://patentscur.reedtech.com/downloads/GrantRedBookBib/1998/pba19980106_wk01.zip http://patentscur.reedtech.com/downloads/GrantRedBookBib/1998/1998.zip (the GrantRedBookBib subfolder in source appears misleading, as the text files, when manually extracted, are clearly APS).

both zips fail during TransformCli with a NoSuchElement exception and appear to mis-classify the file as CpcMasterFile format - log:

2017-05-02 18:03:57,678 INFO [ main] TransformerCli - --- Start --- 2017-05-02 18:03:57,709 INFO [ main] 1998.zip TransformerCli - Dump File[1]: C:\data\out\uspto\grant_bibliographic\1998\1998.zip 2017-05-02 18:03:57,709 INFO [ main] 1998.zip PatentDocFormatDetect - PatentDocFormat fromFileName: CpcMasterFile 2017-05-02 18:03:57,724 INFO [ main] 1998.zip ZipReader - Reading zip file: C:\data\out\uspto\grant_bibliographic\1998\1998.zip Exception in thread "main" java.util.NoSuchElementException at gov.uspto.common.file.archive.ZipReader.next(ZipReader.java:122) at gov.uspto.patent.bulk.DumpFile.open(DumpFile.java:65) at gov.uspto.patent.bulk.DumpFileXml.open(DumpFileXml.java:31) at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:166) at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122) at gov.uspto.patent.TransformerCli.main(TransformerCli.java:301)

Manually extracting the text file in the zip doesn't get much further, in the case of 1998, the file is too large for my VM (1.7GB)

2017-05-02 18:05:23,347 INFO [ main] TransformerCli - --- Start --- 2017-05-02 18:05:23,378 INFO [ main] 1998.dat TransformerCli - Dump File[1]: C:\data\out\uspto\grant_bibliographic\1998\1998.dat 2017-05-02 18:05:23,378 INFO [ main] 1998.dat PatentDocFormatDetect - PatentDocFormat fromFileName: CpcMasterFile 2017-05-02 18:05:23,378 INFO [ main] 1998.dat PatentDocFormatDetect - PatentType fromContent: Greenbook Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421) at java.lang.StringBuilder.append(StringBuilder.java:136) at gov.uspto.patent.bulk.DumpFileXml.read(DumpFileXml.java:66) at gov.uspto.patent.bulk.DumpFile.next(DumpFile.java:92) at gov.uspto.patent.bulk.DumpFile.next(DumpFile.java:1) at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:173) at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122) at gov.uspto.patent.TransformerCli.main(TransformerCli.java:301)

or using 1990 file, the file size appears ok but the format can't parse:

2017-05-02 18:06:59,785 INFO [ main] TransformerCli - --- Start --- 2017-05-02 18:06:59,826 INFO [ main] 1990.dat TransformerCli - Dump File[1]: C:\data\out\uspto\grant_bibliographic\1990\1990.dat 2017-05-02 18:06:59,828 INFO [ main] 1990.dat PatentDocFormatDetect - PatentDocFormat fromFileName: CpcMasterFile 2017-05-02 18:06:59,831 INFO [ main] 1990.dat PatentDocFormatDetect - PatentType fromContent: Greenbook Exception in thread "main" java.lang.NullPointerException at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:175) at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122) at gov.uspto.patent.TransformerCli.main(TransformerCli.java:301)

bgfeldm commented 5 years ago

Try renaming the file to start with "pftaps".

Rename 1990.zip to pftaps1990.zip

In the future I may introduce an option to manually provide the patent type.

Also, try using the new transformer

gov.uspto.bulkdata.cli.Transformer --input="./download/pftaps1990.zip" --skip=0 --limit=0 --type="json_flat" --outDir="./target/output" --bulkKV=true --outputBulkFile=true