WheatonCS / Lexos

Python/Flask-based website for text analysis workflow. Previous (stable) release is live at:
http://lexos.wheatoncollege.edu
MIT License
120 stars 20 forks source link

Force MultiCloud file uploads to UTF-8 #498

Closed scottkleinman closed 7 years ago

scottkleinman commented 7 years ago

For topic clouds, word-topic-counts.txt file that is not encoded in UTF-8 will cause encoding errors when MultiCloud tries to send a JSON string back to the client. The workaround is to upload the file in the normal file uploader, then download it from Lexos, and re-upload it in Multicloud. That seems to indicate that we just skipped forcing the file to UTF-8 when we did the MultiCloud upload, and it should be easy to fix.

czhang03 commented 7 years ago

or, can we handle decoding in the back end? Are you working with py3 branch?

scottkleinman commented 7 years ago

Upon consideration, I think this is a bug in Lexos 3.0 that should be fixed in the master branch. As for the Python 3 version, I'm not sure what happens when you upload a file that is not in UTF-8.

But maybe we can decode right before the response is sent to the client. I'll try that when I get a chance.

czhang03 commented 7 years ago

I implement a function to deal with request.file encoding, see this

maybe this will help?

scottkleinman commented 7 years ago

Yes, I wondered if that was relevant. Or we may just need to call the chardet routine that we use for normal file uploads. But I haven't looked at that code in a very long time...

czhang03 commented 7 years ago

this is actually the "chardet routine" that you are talking about.

I just wrapped that into a function.

scottkleinman commented 7 years ago

OK, then we just have to make sure the Multicloud file upload calls that routine.

czhang03 commented 7 years ago

See this: https://github.com/WheatonCS/Lexos/blob/py3/managers/file_manager.py#L286

It is the same function, I just take it out and then put it in general_function, since it is useful in handling request.file

czhang03 commented 7 years ago

But, if you gonna fix it on the master, then I suggest use a temp fix (like copy and paste the routine there), and record that in the roadmap. Otherwise it may cause merge conflicts with py3 branch when merging master and py3 later.

scottkleinman commented 7 years ago

Yes, I had thought of that.

scottkleinman commented 7 years ago

@chantisnake, I'm having trouble injecting your fix. From what I can tell, the uploaded file is saved at https://github.com/WheatonCS/Lexos/blob/master/managers/utility.py#L957. But I'm not sure whether chardet gets called in the save() or whether I need to add a line before this one to do the encoding change if a file is not UTF-8. Can you have a look at it?

czhang03 commented 7 years ago

hum, this code looks like something stupid I wrote in my youth... (not that stupid, it is "okay" stupid)

The code saves and then reads out the content, but the content can be directly read out from request.file.

I will investigate this for a while and get back to you soon, because my girl friend need to use my work station to watch movie :(

scottkleinman commented 7 years ago

The little things that stand in the way of progress, forever preserved on GitHub....

czhang03 commented 7 years ago

I go back and forth on the solution, I think the most appropriate solution is to get the encoding first and then open the file with that encoding

the code will be looking like this: change

if topicString != '':
    request. Files['optuploadname'].save(contentPath)
with open(contentPath, 'r') as f:
    content = f.read()  # reads content from the upload 
if content.startswith('#doc source pos typeindex type topic'):
    # --- begin converting a Mallet file into the file d3 can understand ---
    tuples = []
    # Read the output_state file
    with open(contentPath) as f:
           # Skip the first three lines

to

if topicString != '':
    request. Files['optuploadname'].save(contentPath)
    content = request.Files['optuploadname']
    file_encoding = get_encoding_using_chardet(content)
with codecs.open(contentPath, 'r', encoding=file_encoding) as f:
    content = f.read()  # reads content from the upload 
if content.startswith('#doc source pos typeindex type topic'):
    # --- begin converting a Mallet file into the file d3 can understand ---
    tuples = []
    # Read the output_state file
    with codecs.open(contentPath, encoding=file_encoding) as f:
           # Skip the first three lines

There is probably some minor error in this code, because I am not able to test this code.

This is still not optimized, therefore add a TODO or add this in the roadmap.

scottkleinman commented 7 years ago

get_encoding_using_chardet() does not seem to be a pre-existing function. The way we do it in filemanager.py is

encodingDetect = chardet.detect(File[:constants.MIN_ENCODING_DETECT])
encodingType = encodingDetect['encoding']
fileString = File.decode(encodingType)

I think the last line might actually have to be fileString = File.decode(encodingType).encode('utf-8') for this scenario.

It's worth considering whether we should also replace \r and \r\n with \n, as we do for normal file uploads.

czhang03 commented 7 years ago

yes, that is not a pre-existing function, just copy and paste the code from filemanager

I don't think line ending is important, because we uses readline to get content

scottkleinman commented 7 years ago

This works:

if topicString != '':
    request.files['optuploadname'].save(contentPath)
    with open(contentPath, 'r') as f:
        content = f.read()  # reads content from the upload file
        # Coerce to UTF-8, if necessary
        import chardet
        encodingDetect = chardet.detect(content)
        encodingType = encodingDetect['encoding']
        if encodingType != 'utf-8':
            content = content.decode(encodingType).encode('utf-8')

But importing chardet here is pretty ugly. We probably want to create a function that returns the encoding type with chardet in general_functions.py.

czhang03 commented 7 years ago

why not add MIN_ENCODING_DETECT?

scottkleinman commented 7 years ago

We could do that. I couldn't decide whether it was a good idea for MALLET data, so I left it off for simplicity. Since I didn't work on the chardet implementation, I only have a vague idea of the background, but I seem to recall that limiting it to the first N characters sometimes failed to detect non UTF-encoded characters later in the file. Maybe I didn't quite get what the issue was with that. IF you think it's safe to add MIN_ENCODiNG_DETECT, or even use a value different from the constant, that's easy to add.

czhang03 commented 7 years ago

how large is a mallet file? Since chardet is slow, we may need a little testing and profiling.

scottkleinman commented 7 years ago

Let's put that question on hold. I am working with another group that will be generating a lot of MALLET files in the next month or so. I'll collect them and try to work out an average profile.

scottkleinman commented 7 years ago

I've pushed the fix with the MIN_ENCODING_DETECT constant. We can always remove it. To call the function, use encoding = general_functions.get_encoding(str). I have also made note of where any new commits will have to be merged into the Python 3 branch.