AdamMeyers / The_Termolator

Termonology Extraction Program (English Version)
41 stars 19 forks source link

encoding issue #1

Open craigpfeifer opened 9 years ago

craigpfeifer commented 9 years ago

For the last known issue you posted : UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 7: ordinal not in range(128)

You mention that it is a clash of NLTK versions. Do you know which version(s) work?

I can say for a fact that (installed via pip ... nltk== ) v3.0.0 v3.0.1 v3.0.4

have the issue on OSX 10.10.5 w/ Python 2.7.10 [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin

AdamMeyers commented 9 years ago

Hi Craig,

I am testing a solution to this problem and will update github when I have finished the test. It worked OK on at least one initial test.

If you would like to try the solution, please edit $Termolator/term_extraction_v3/Filter.py (I see I cannot attach files to comments). Replace the function "stem" with the following:

def bad_unicode(string): for char in string: if ord(char)>127: print(char) return(True) def stem(string): """Stem a phrase""" global stemmer if not stemmer: stemmer = Stemmer()

words = string.split()

#for i in range(len(words)):
#    words[i] = self.stemmer.stem(words[i])
# stemming last word only
#string = self._reGlue(words)
#
#string2 = stemmer.stem(string)
#if string2 not in stemdict:
#    stemdict[string2] = string
# FIX ME
if string not in stemdict:
    if bad_unicode(string):
        ## added A. Meyers 8/28/15
        temp = False
    else:
        temp = stemmer.stem(string)
    if temp:
        stemdict[string] = temp
    if not temp:
        pass
    elif temp not in unstemdict:
        unstemdict[temp] = [string]
    elif string not in unstemdict[temp]:
        unstemdict[temp].append(string)
else:
    temp = stemdict[string]
return temp

Apparently, this was a common issue that others were having with nltk's Porter Stemmer.

Best,

Adam

On 08/24/2015 04:47 PM, craig pfeifer wrote:

For the last known issue you posted : UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 7: ordinal not in range(128)

You mention that it is a clash of NLTK versions. Do you know which version(s) work?

I can say for a fact that v3.0.4 has this issue.

— Reply to this email directly or view it on GitHub https://github.com/AdamMeyers/The_Termolator/issues/1.

craigpfeifer commented 9 years ago

Now I'm getting a different error:

Traceback (most recent call last): File "code/The_Termolator/filter_term_output.py", line 26, in if name == 'main': sys.exit(main(sys.argv)) File "code/The_Termolator/filter_term_output.py", line 24, in main filter_terms(input_file,output_file,abbr_full_file,full_abbr_file,use_web_score,numeric_cutoff=max_term_number,reject_file=reject_file) File "code/The_Termolator/filter_terms.py", line 911, in filter_terms lines = instream.readlines() File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/codecs.py", line 319, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 0: invalid continuation byte

The documents I am working with are MS Office (.docx) that have been converted to text by Apache Tika 1.9 (https://tika.apache.org/)

One solution to the above problem is to change line 911 in filter_terms.py to: instream = open(infile,encoding='utf-8',errors='ignore') from instream = open(infile)

But I'm not sure this is the optimal solution.

AdamMeyers commented 9 years ago

OK,

I have another fix, and at the same time a fix of a different problem.

Please let me know if this fixes it. It seems like the 3.0 version of the porter stemmer corrupts non-asscii characters so later components designed to deal with utf-8 have trouble with them. So this version simply replaces non-ascii characters with spaces. Given the filters in place anyway, I don't think this will be a bad thing.

In any event, please let me know if this does or does not let you finish your run. If it does not fix it, I would like to see some input files or even intermediate files (like the .all_terms file generated by the program).

Thanks

Best,

Adam Meyers

On 08/28/2015 09:29 AM, craig pfeifer wrote:

Now I'm getting a different error:

Traceback (most recent call last): File "code/The_Termolator/filter_term_output.py", line 26, in if name == 'main': sys.exit(main(sys.argv)) File "code/The_Termolator/filter_term_output.py", line 24, in main filter_terms(input_file,output_file,abbr_full_file,full_abbr_file,use_web_score,numeric_cutoff=max_term_number,reject_file=reject_file) File "code/The_Termolator/filter_terms.py", line 911, in filter_terms lines = instream.readlines() File "/usr/local/Cellar/python3/3.4.3/Frameworks/Python.framework/Versions/3.4/lib/python3.4/codecs.py", line 319, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc2 in position 0: invalid continuation byte

— Reply to this email directly or view it on GitHub https://github.com/AdamMeyers/The_Termolator/issues/1#issuecomment-135775726.

craigpfeifer commented 9 years ago

Okay, what is the fix? I don't see anything in the text of your comment, and I don't see anything in the commit log?

AdamMeyers commented 9 years ago

Oh, Sorry, Once again, I forgot that Github does not allow one to attach files. I committed the 2 changed files.

AdamMeyers commented 9 years ago

OK, I committed a new filter_terms.py file, using

instream = open(infile,errors='replace')

instead of

instream = open(infile)

Perhaps this will correct the problem, but if not, copies of the problematic input would help.

The default is already utf8, so this did not need to be included in the command. Also, it was probably not a good idea to ignore utf-8 encoding errors, but rather replace them with another utf8 character that would cause other parts of the code to filter that word out anyway

craigpfeifer commented 9 years ago

Agreed, this does solve my issue! Thanks!