Closed scottkleinman closed 7 years ago
or, can we handle decoding in the back end? Are you working with py3 branch?
Upon consideration, I think this is a bug in Lexos 3.0 that should be fixed in the master branch. As for the Python 3 version, I'm not sure what happens when you upload a file that is not in UTF-8.
But maybe we can decode right before the response is sent to the client. I'll try that when I get a chance.
I implement a function to deal with request.file
encoding, see this
maybe this will help?
Yes, I wondered if that was relevant. Or we may just need to call the chardet
routine that we use for normal file uploads. But I haven't looked at that code in a very long time...
this is actually the "chardet routine" that you are talking about.
I just wrapped that into a function.
OK, then we just have to make sure the Multicloud file upload calls that routine.
See this: https://github.com/WheatonCS/Lexos/blob/py3/managers/file_manager.py#L286
It is the same function, I just take it out and then put it in general_function, since it is useful in handling request.file
But, if you gonna fix it on the master, then I suggest use a temp fix (like copy and paste the routine there), and record that in the roadmap. Otherwise it may cause merge conflicts with py3 branch when merging master and py3 later.
Yes, I had thought of that.
@chantisnake, I'm having trouble injecting your fix. From what I can tell, the uploaded file is saved at https://github.com/WheatonCS/Lexos/blob/master/managers/utility.py#L957. But I'm not sure whether chardet
gets called in the save()
or whether I need to add a line before this one to do the encoding change if a file is not UTF-8. Can you have a look at it?
hum, this code looks like something stupid I wrote in my youth... (not that stupid, it is "okay" stupid)
The code saves and then reads out the content, but the content can be directly read out from request.file
.
I will investigate this for a while and get back to you soon, because my girl friend need to use my work station to watch movie :(
The little things that stand in the way of progress, forever preserved on GitHub....
I go back and forth on the solution, I think the most appropriate solution is to get the encoding first and then open the file with that encoding
the code will be looking like this: change
if topicString != '':
request. Files['optuploadname'].save(contentPath)
with open(contentPath, 'r') as f:
content = f.read() # reads content from the upload
if content.startswith('#doc source pos typeindex type topic'):
# --- begin converting a Mallet file into the file d3 can understand ---
tuples = []
# Read the output_state file
with open(contentPath) as f:
# Skip the first three lines
to
if topicString != '':
request. Files['optuploadname'].save(contentPath)
content = request.Files['optuploadname']
file_encoding = get_encoding_using_chardet(content)
with codecs.open(contentPath, 'r', encoding=file_encoding) as f:
content = f.read() # reads content from the upload
if content.startswith('#doc source pos typeindex type topic'):
# --- begin converting a Mallet file into the file d3 can understand ---
tuples = []
# Read the output_state file
with codecs.open(contentPath, encoding=file_encoding) as f:
# Skip the first three lines
There is probably some minor error in this code, because I am not able to test this code.
This is still not optimized, therefore add a TODO or add this in the roadmap.
get_encoding_using_chardet()
does not seem to be a pre-existing function. The way we do it in filemanager.py
is
encodingDetect = chardet.detect(File[:constants.MIN_ENCODING_DETECT])
encodingType = encodingDetect['encoding']
fileString = File.decode(encodingType)
I think the last line might actually have to be fileString = File.decode(encodingType).encode('utf-8')
for this scenario.
It's worth considering whether we should also replace \r
and \r\n
with \n
, as we do for normal file uploads.
yes, that is not a pre-existing function, just copy and paste the code from filemanager
I don't think line ending is important, because we uses readline
to get content
This works:
if topicString != '':
request.files['optuploadname'].save(contentPath)
with open(contentPath, 'r') as f:
content = f.read() # reads content from the upload file
# Coerce to UTF-8, if necessary
import chardet
encodingDetect = chardet.detect(content)
encodingType = encodingDetect['encoding']
if encodingType != 'utf-8':
content = content.decode(encodingType).encode('utf-8')
But importing chardet
here is pretty ugly. We probably want to create a function that returns the encoding type with chardet in general_functions.py
.
why not add MIN_ENCODING_DETECT
?
We could do that. I couldn't decide whether it was a good idea for MALLET data, so I left it off for simplicity. Since I didn't work on the chardet
implementation, I only have a vague idea of the background, but I seem to recall that limiting it to the first N characters sometimes failed to detect non UTF-encoded characters later in the file. Maybe I didn't quite get what the issue was with that. IF you think it's safe to add MIN_ENCODiNG_DETECT
, or even use a value different from the constant, that's easy to add.
how large is a mallet file? Since chardet is slow, we may need a little testing and profiling.
Let's put that question on hold. I am working with another group that will be generating a lot of MALLET files in the next month or so. I'll collect them and try to work out an average profile.
I've pushed the fix with the MIN_ENCODING_DETECT
constant. We can always remove it. To call the function, use encoding = general_functions.get_encoding(str)
. I have also made note of where any new commits will have to be merged into the Python 3 branch.
For topic clouds,
word-topic-counts.txt
file that is not encoded in UTF-8 will cause encoding errors when MultiCloud tries to send a JSON string back to the client. The workaround is to upload the file in the normal file uploader, then download it from Lexos, and re-upload it in Multicloud. That seems to indicate that we just skipped forcing the file to UTF-8 when we did the MultiCloud upload, and it should be easy to fix.