USPTO / PatentPublicData

Utility tools to help download and parse patent data made available to the public
Other
179 stars 81 forks source link

TransformerCLI fails for two records #84

Open legolego opened 5 years ago

legolego commented 5 years ago

Hello, I found a couple more bugs, TransformerCLI failed for these patents and dropped out to the command prompt. The two source XML files are attached.

patents.zip

2019-03-15 17:49:05,394 INFO [main] TransformerCli - Record: 'US8299092B2' from D:\patents\ipg121030.zip:2659 Exception in thread "main" java.lang.StringIndexOutOfBoundsException: begin 0, end 2, length 1 at java.base/java.lang.String.checkBoundsBeginEnd(Unknown Source) at java.base/java.lang.String.substring(Unknown Source) at gov.uspto.patent.doc.xml.items.DocumentIdNode.read(DocumentIdNode.java:60) at gov.uspto.patent.doc.xml.fragments.CitationNode.readPatCitations(CitationNode.java:144) at gov.uspto.patent.doc.xml.fragments.CitationNode.read(CitationNode.java:63) at gov.uspto.patent.doc.xml.GrantParser.parse(GrantParser.java:113) at gov.uspto.parser.dom4j.Dom4JParser.parse(Dom4JParser.java:90) at gov.uspto.patent.PatentReader.read(PatentReader.java:82) at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:187) at gov.uspto.patent.TransformerCli.process(TransformerCli.java:129) at gov.uspto.patent.TransformerCli.main(TransformerCli.java:307)

and 2019-03-16 09:34:18,090 INFO [main] TransformerCli - Record: 'USPP022671P2' from D:\patents\ipg120417.zip:435 Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 0 at java.base/java.lang.StringLatin1.charAt(Unknown Source) at java.base/java.lang.String.charAt(Unknown Source) at gov.uspto.common.text.StringCaseUtil.toTitleCase(StringCaseUtil.java:102) at gov.uspto.patent.doc.xml.GrantParser.parse(GrantParser.java:69) at gov.uspto.parser.dom4j.Dom4JParser.parse(Dom4JParser.java:90) at gov.uspto.patent.PatentReader.read(PatentReader.java:82) at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:187) at gov.uspto.patent.TransformerCli.process(TransformerCli.java:129) at gov.uspto.patent.TransformerCli.main(TransformerCli.java:307)

bgfeldm commented 5 years ago

I am not able to reproduce the errors above. The second one looks similar to the previous fixed issue #81 .

legolego commented 5 years ago

Ok, I tried got the latest version and tried with the files I sent, and it didn't fail. I tried again with the large zip source files (ipg121030.zip and ipg120417.zip) and it did fail. I made small xml files of the previous patent numbers (US8299092B2 and USPP022671P2) and their respective next patent in the large source zip files, and they failed again. The new xml files are attached. patents2.zip

legolego commented 5 years ago

Here's one more place where the latest transformer code fails, file attached. I think US9524869 is the file failing. 161220.zip 2019-03-28 18:30:27,349 INFO [main] TransformerCli - --- Start --- 2019-03-28 18:30:41,630 INFO [main] TransformerCli - Dump File[1]: D:\patents\161220.xml 2019-03-28 18:30:41,631 INFO [main] PatentDocFormatDetect - PatentDocFormat fromFileName: CpcMasterFile 2019-03-28 18:30:41,635 INFO [main] PatentDocFormatDetect - PatentType fromContent: RedbookGrant 2019-03-28 18:30:42,300 INFO [main] TransformerCli - Record: 'US9524868B2' from D:\patents\161220.xml:2 Exception in thread "main" java.lang.StringIndexOutOfBoundsException: begin 0, end 2, length 1 at java.base/java.lang.String.checkBoundsBeginEnd(Unknown Source) at java.base/java.lang.String.substring(Unknown Source) at gov.uspto.patent.doc.xml.items.DocumentIdNode.read(DocumentIdNode.java:63) at gov.uspto.patent.doc.xml.fragments.CitationNode.readPatCitations(CitationNode.java:144) at gov.uspto.patent.doc.xml.fragments.CitationNode.read(CitationNode.java:63) at gov.uspto.patent.doc.xml.GrantParser.parse(GrantParser.java:113) at gov.uspto.parser.dom4j.Dom4JParser.parse(Dom4JParser.java:90) at gov.uspto.patent.PatentReader.read(PatentReader.java:82) at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:187) at gov.uspto.patent.TransformerCli.process(TransformerCli.java:129) at gov.uspto.patent.TransformerCli.main(TransformerCli.java:307)

bgfeldm commented 5 years ago

Fixed this current issue with Index Out Of Bounds error on small document-numbers, early patent numbers, with length below 3.

Still need to look at the trailing document issue you noted above, believe it may be due to enclosed xml tags.