Open legolego opened 5 years ago
I am not able to reproduce the errors above. The second one looks similar to the previous fixed issue #81 .
Ok, I tried got the latest version and tried with the files I sent, and it didn't fail. I tried again with the large zip source files (ipg121030.zip and ipg120417.zip) and it did fail. I made small xml files of the previous patent numbers (US8299092B2 and USPP022671P2) and their respective next patent in the large source zip files, and they failed again. The new xml files are attached. patents2.zip
Here's one more place where the latest transformer code fails, file attached.
I think US9524869 is the file failing.
161220.zip
2019-03-28 18:30:27,349 INFO [main] TransformerCli - --- Start --- 2019-03-28 18:30:41,630 INFO [main] TransformerCli - Dump File[1]: D:\patents\161220.xml 2019-03-28 18:30:41,631 INFO [main] PatentDocFormatDetect - PatentDocFormat fromFileName: CpcMasterFile 2019-03-28 18:30:41,635 INFO [main] PatentDocFormatDetect - PatentType fromContent: RedbookGrant 2019-03-28 18:30:42,300 INFO [main] TransformerCli - Record: 'US9524868B2' from D:\patents\161220.xml:2 Exception in thread "main" java.lang.StringIndexOutOfBoundsException: begin 0, end 2, length 1 at java.base/java.lang.String.checkBoundsBeginEnd(Unknown Source) at java.base/java.lang.String.substring(Unknown Source) at gov.uspto.patent.doc.xml.items.DocumentIdNode.read(DocumentIdNode.java:63) at gov.uspto.patent.doc.xml.fragments.CitationNode.readPatCitations(CitationNode.java:144) at gov.uspto.patent.doc.xml.fragments.CitationNode.read(CitationNode.java:63) at gov.uspto.patent.doc.xml.GrantParser.parse(GrantParser.java:113) at gov.uspto.parser.dom4j.Dom4JParser.parse(Dom4JParser.java:90) at gov.uspto.patent.PatentReader.read(PatentReader.java:82) at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:187) at gov.uspto.patent.TransformerCli.process(TransformerCli.java:129) at gov.uspto.patent.TransformerCli.main(TransformerCli.java:307)
Fixed this current issue with Index Out Of Bounds error on small document-numbers, early patent numbers, with length below 3.
Still need to look at the trailing document issue you noted above, believe it may be due to enclosed xml tags.
Hello, I found a couple more bugs, TransformerCLI failed for these patents and dropped out to the command prompt. The two source XML files are attached.
patents.zip
2019-03-15 17:49:05,394 INFO [main] TransformerCli - Record: 'US8299092B2' from D:\patents\ipg121030.zip:2659 Exception in thread "main" java.lang.StringIndexOutOfBoundsException: begin 0, end 2, length 1 at java.base/java.lang.String.checkBoundsBeginEnd(Unknown Source) at java.base/java.lang.String.substring(Unknown Source) at gov.uspto.patent.doc.xml.items.DocumentIdNode.read(DocumentIdNode.java:60) at gov.uspto.patent.doc.xml.fragments.CitationNode.readPatCitations(CitationNode.java:144) at gov.uspto.patent.doc.xml.fragments.CitationNode.read(CitationNode.java:63) at gov.uspto.patent.doc.xml.GrantParser.parse(GrantParser.java:113) at gov.uspto.parser.dom4j.Dom4JParser.parse(Dom4JParser.java:90) at gov.uspto.patent.PatentReader.read(PatentReader.java:82) at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:187) at gov.uspto.patent.TransformerCli.process(TransformerCli.java:129) at gov.uspto.patent.TransformerCli.main(TransformerCli.java:307)
and
2019-03-16 09:34:18,090 INFO [main] TransformerCli - Record: 'USPP022671P2' from D:\patents\ipg120417.zip:435 Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 0 at java.base/java.lang.StringLatin1.charAt(Unknown Source) at java.base/java.lang.String.charAt(Unknown Source) at gov.uspto.common.text.StringCaseUtil.toTitleCase(StringCaseUtil.java:102) at gov.uspto.patent.doc.xml.GrantParser.parse(GrantParser.java:69) at gov.uspto.parser.dom4j.Dom4JParser.parse(Dom4JParser.java:90) at gov.uspto.patent.PatentReader.read(PatentReader.java:82) at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:187) at gov.uspto.patent.TransformerCli.process(TransformerCli.java:129) at gov.uspto.patent.TransformerCli.main(TransformerCli.java:307)