USPTO / PatentPublicData

Utility tools to help download and parse patent data made available to the public
Other
182 stars 80 forks source link

TransformerCli throws a NPE and stop processing on pftaps19790619_wk25.zip #39

Closed pedagogly closed 7 years ago

pedagogly commented 7 years ago
2017-01-18 14:00:24,305 INFO  [main] TransformerCli - Record: 'US4158475A' from pftaps19790619_wk25.zip:297
2017-01-18 14:00:24,309 WARN  [main] DocumentIdNode - Invalid document-id, field 'WKU' not found
Exception in thread "main" java.lang.NullPointerException
    at gov.uspto.patent.doc.greenbook.Greenbook.parse(Greenbook.java:92)
    at gov.uspto.parser.dom4j.keyvalue.KvParser.parse(KvParser.java:49)
    at gov.uspto.patent.PatentReader.read(PatentReader.java:70)
    at gov.uspto.patent.TransformerCli.processDumpFile(TransformerCli.java:178)
    at gov.uspto.patent.TransformerCli.process(TransformerCli.java:122)
    at gov.uspto.patent.TransformerCli.main(TransformerCli.java:296)
bgfeldm commented 7 years ago

I have a fix for the NPE but trying to also fix the under lining cause

The document has an extra line after PATN:

PATN

WKU  041584767
SRC  5
APN  8611106

Which ends up giving a blank value for PATN, note the space:

KeyValue [key=PATN, value= ], 
KeyValue [key=WKU, value=041584767], 
KeyValue [key=SRC, value=5], 

Which gets mapped to incorrect XML, PATN should wrap all the fields:

<DOCUMENT><PATN> </PATN><WKU>041584767</WKU>

I should have a fix soon.