USPTO / PatentPublicData

Utility tools to help download and parse patent data made available to the public
Other
180 stars 81 forks source link

Transformer Unknown output type: xml #95

Closed sotnikov-s closed 4 years ago

sotnikov-s commented 4 years ago

it is said here https://github.com/USPTO/PatentPublicData/blob/master/Tools.md#transform-read-normalize-transform that it's possible to use xml as an argument for --type flag:

--type [String: types options: [raw,xml,json,json_flat,patft,object,text]]

but when I do so, I got an error:

java -cp "target/BulkDownloader-0.0.1-SNAPSHOT.jar:target/dependency-jars/*" gov.uspto.bulkdata.cli.Transformer -f="pftaps20010102_wk01.zip" --type="xml" --outDir="." --outBulk=false --kv=true
log4j:WARN No appenders could be found for logger (gov.uspto.patent.PatentDocFormatDetect).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Exception in thread "main" java.lang.RuntimeException: Unknown output type: xml
    at gov.uspto.bulkdata.tools.transformer.TransformerRecordProcessor.writeOutputType(TransformerRecordProcessor.java:167)
    at gov.uspto.bulkdata.tools.transformer.TransformerRecordProcessor.process(TransformerRecordProcessor.java:90)
    at gov.uspto.bulkdata.RecordReader.read(RecordReader.java:195)
    at gov.uspto.bulkdata.RecordReader.read(RecordReader.java:122)
    at gov.uspto.bulkdata.RecordReader.read(RecordReader.java:85)
    at gov.uspto.bulkdata.RecordReader.read(RecordReader.java:43)
    at gov.uspto.bulkdata.cli.Transformer.exec(Transformer.java:77)
    at gov.uspto.bulkdata.cli.Transformer.main(Transformer.java:115)

Could you please let me know who is wrong - me or the documentation?

bgfeldm commented 4 years ago

Thank you for pointing out the discrepancy in the documentation. I will update the documentation.

Currently most of the output formats are json (json, json_flat, patft, and solr); I guess, I could, provide an abstraction to allow the document builder to support different output document formats.