USPTO / PatentPublicData

Utility tools to help download and parse patent data made available to the public
Other
182 stars 80 forks source link

Cannot read bulk #46

Closed patricknee closed 7 years ago

patricknee commented 7 years ago

Current version of TransformerCli generates .bulk files that cannot be iterated over with Python 3.5.

Previously the following Python code was able to iterate over a bulk file generated with TransformerCli. The currently pulled version of the Java code generates a bulk file that crashes with the following message:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1164: ordinal not in range(128)

This generally occurs multiple patents into the file, that is, after processing as many as 100 patents in the files I tested.

This has been shown to occur on the following files:

pftaps19780103_wk01.bulk pftaps19840103_wk01.bulk ipg100112.bulk

Python 3.5; to run, type (on my Anaconda setup):

python3 fileName.py /Users/patrick/pftaps19780103_wk01.bulk

Put following code into fileName.py:

import os
import sys
import json

class objIterate(object):

    def IterateOverFile(self, fileName):
        with open(fileName) as file:
            for patentJson in file:
                patent = json.loads(patentJson)
                documentID = patent['documentId']
                print("{}".format(documentID))

if __name__ == '__main__' and __package__ is None:

    inserter = objIterate()
    fileName = sys.argv[1]
    inserter.IterateOverFile(fileName)
patricknee commented 7 years ago

In case it helps, a version I have that works is based on 2016-11-11's 194b5db, so the change was introduced sometime after, so I'm not able to easily pull in the recent fixes.

bgfeldm commented 7 years ago

I made changes to ensure Unicode characters are kept.

Your Python code can be updated for Unicode:

with open('unicode.txt', encoding='utf-8') as file:

https://docs.python.org/3/howto/unicode.html

patricknee commented 7 years ago

Thank you for debugging my code! ;-)