Open ThomasHeliase opened 3 years ago
also note - issue occurs on the grant-fulltext 2001 pg*.zip files as well.
when compiling with tests on linux, Java 1.8.0_281, the following sgml tests fail, also with an IOException: Stream closed
error:
Tests in error:
multipleAssignee(gov.uspto.patent.doc.sgml.SgmlTest): java.io.IOException: Stream closed
readSamples(gov.uspto.patent.doc.sgml.SgmlTest): java.io.IOException: Stream closed
I have a similar issue using gov.uspto.bulkdata.cli.View. Has anyone found a solution?
Unable to use current java with pgb (sgml) files, produces either no files, empty files or 1-2 files with a single record. All errors in console are 'gov.uspto.patent.PatentReaderException: java.io.IOException: Stream closed' errors.
Same behaviour for 2001,2002,2003,2004. 2001 is worst with 0 files produced.
Issue not present once format changes in 2005.
Production is ubuntu 20.04, tested on Win10 as well, same result. Suspect some JDK compatibility with SAXParser, seems to fail at record line terminator but isn't consistent.
JDKs tested OpenJDK 11 and 14 in 1.8 compat for compile.
Command to reproduce:
java -Dlog4j.configuration=file:log4j.properties -jar uspto-transform.jar gov.uspto.bulkdata.cli.Transformer -f "/extract/uspto/grant-bibliographic/2003/pgb20030114_wk02.zip" --type "json_flat" --outDir "/transform/uspto/grant-bibliographic/2003/" --outBulk true --prettyPrint false
sample of console error:
2021-02-22 11:20:15,940 INFO [ main] :: PatentDocFormatDetect - PatentDocFormat fromFileName: Sgml 2021-02-22 11:20:15,943 INFO [ main] :: PatentDocFormatDetect - PatentDocFormat fromFileName: Sgml 2021-02-22 11:20:15,951 INFO [ main] pgb20040106_wk01.zip:: ZipReader - Reading zip file: /extract/uspto/grant-bibliographic/2004/pgb20040106_wk01.zip 2021-02-22 11:20:15,990 INFO [ main] pgb20040106_wk01.zip:: ZipReader - Found 1 file[FileFilter [matchRules=[SuffixFileFilter(xml,sgm,sgml)]]]: pgb20040106.xml 2021-02-22 11:20:15,993 INFO [ main] pgb20040106_wk01.zip:: PatentDocFormatDetect - PatentType fromContent: Sgml Exception in thread "main" gov.uspto.patent.PatentReaderException: java.io.IOException: Stream closed at gov.uspto.parser.dom4j.Dom4JParser.parse(Dom4JParser.java:117) at gov.uspto.patent.PatentReader.read(PatentReader.java:82) at gov.uspto.bulkdata.tools.transformer.TransformerRecordProcessor.process(TransformerRecordProcessor.java:72) at gov.uspto.bulkdata.RecordReader.read(RecordReader.java:195) at gov.uspto.bulkdata.RecordReader.read(RecordReader.java:122) at gov.uspto.bulkdata.RecordReader.read(RecordReader.java:85) at gov.uspto.bulkdata.RecordReader.read(RecordReader.java:43) at gov.uspto.bulkdata.cli.Transformer.exec(Transformer.java:77) at gov.uspto.bulkdata.cli.Transformer.main(Transformer.java:115) Caused by: java.io.IOException: Stream closed at java.base/java.io.StringReader.ensureOpen(StringReader.java:56) at java.base/java.io.StringReader.reset(StringReader.java:188) at gov.uspto.parser.dom4j.Dom4JParser.parse(Dom4JParser.java:114) ... 8 more