Closed patricknee closed 7 years ago
In case it helps, a version I have that works is based on 2016-11-11's 194b5db, so the change was introduced sometime after, so I'm not able to easily pull in the recent fixes.
I made changes to ensure Unicode characters are kept.
Your Python code can be updated for Unicode:
with open('unicode.txt', encoding='utf-8') as file:
Thank you for debugging my code! ;-)
Current version of TransformerCli generates .bulk files that cannot be iterated over with Python 3.5.
Previously the following Python code was able to iterate over a bulk file generated with TransformerCli. The currently pulled version of the Java code generates a bulk file that crashes with the following message:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1164: ordinal not in range(128)
This generally occurs multiple patents into the file, that is, after processing as many as 100 patents in the files I tested.
This has been shown to occur on the following files:
pftaps19780103_wk01.bulk pftaps19840103_wk01.bulk ipg100112.bulk
Python 3.5; to run, type (on my Anaconda setup):
python3 fileName.py /Users/patrick/pftaps19780103_wk01.bulk
Put following code into fileName.py: