USPTO / PatentPublicData

Utility tools to help download and parse patent data made available to the public
Other
179 stars 81 forks source link

grants-bibliographic 2001-2004 transforms failing with java.io.IOException: Stream closed. #116

Open ThomasHeliase opened 3 years ago

ThomasHeliase commented 3 years ago

Unable to use current java with pgb (sgml) files, produces either no files, empty files or 1-2 files with a single record. All errors in console are 'gov.uspto.patent.PatentReaderException: java.io.IOException: Stream closed' errors.

Same behaviour for 2001,2002,2003,2004. 2001 is worst with 0 files produced.

Issue not present once format changes in 2005.

Production is ubuntu 20.04, tested on Win10 as well, same result. Suspect some JDK compatibility with SAXParser, seems to fail at record line terminator but isn't consistent.

JDKs tested OpenJDK 11 and 14 in 1.8 compat for compile.

Command to reproduce: java -Dlog4j.configuration=file:log4j.properties -jar uspto-transform.jar gov.uspto.bulkdata.cli.Transformer -f "/extract/uspto/grant-bibliographic/2003/pgb20030114_wk02.zip" --type "json_flat" --outDir "/transform/uspto/grant-bibliographic/2003/" --outBulk true --prettyPrint false

sample of console error:

2021-02-22 11:20:15,940 INFO [ main] :: PatentDocFormatDetect - PatentDocFormat fromFileName: Sgml 2021-02-22 11:20:15,943 INFO [ main] :: PatentDocFormatDetect - PatentDocFormat fromFileName: Sgml 2021-02-22 11:20:15,951 INFO [ main] pgb20040106_wk01.zip:: ZipReader - Reading zip file: /extract/uspto/grant-bibliographic/2004/pgb20040106_wk01.zip 2021-02-22 11:20:15,990 INFO [ main] pgb20040106_wk01.zip:: ZipReader - Found 1 file[FileFilter [matchRules=[SuffixFileFilter(xml,sgm,sgml)]]]: pgb20040106.xml 2021-02-22 11:20:15,993 INFO [ main] pgb20040106_wk01.zip:: PatentDocFormatDetect - PatentType fromContent: Sgml Exception in thread "main" gov.uspto.patent.PatentReaderException: java.io.IOException: Stream closed at gov.uspto.parser.dom4j.Dom4JParser.parse(Dom4JParser.java:117) at gov.uspto.patent.PatentReader.read(PatentReader.java:82) at gov.uspto.bulkdata.tools.transformer.TransformerRecordProcessor.process(TransformerRecordProcessor.java:72) at gov.uspto.bulkdata.RecordReader.read(RecordReader.java:195) at gov.uspto.bulkdata.RecordReader.read(RecordReader.java:122) at gov.uspto.bulkdata.RecordReader.read(RecordReader.java:85) at gov.uspto.bulkdata.RecordReader.read(RecordReader.java:43) at gov.uspto.bulkdata.cli.Transformer.exec(Transformer.java:77) at gov.uspto.bulkdata.cli.Transformer.main(Transformer.java:115) Caused by: java.io.IOException: Stream closed at java.base/java.io.StringReader.ensureOpen(StringReader.java:56) at java.base/java.io.StringReader.reset(StringReader.java:188) at gov.uspto.parser.dom4j.Dom4JParser.parse(Dom4JParser.java:114) ... 8 more

ThomasHeliase commented 3 years ago

also note - issue occurs on the grant-fulltext 2001 pg*.zip files as well.

when compiling with tests on linux, Java 1.8.0_281, the following sgml tests fail, also with an IOException: Stream closed error:

Tests in error: 
  multipleAssignee(gov.uspto.patent.doc.sgml.SgmlTest): java.io.IOException: Stream closed
  readSamples(gov.uspto.patent.doc.sgml.SgmlTest): java.io.IOException: Stream closed
gabriele-di-bona commented 2 years ago

I have a similar issue using gov.uspto.bulkdata.cli.View. Has anyone found a solution?