Closed wkloucek closed 1 year ago
cc @micbar I encountered this in a project where a user seems to have a file name containing the ◌̈
character :laughing:
Interesting, we got a similar case with a Activitės
in the filename (not the usual ` accent), and with a similar traceback:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/werkzeug/serving.py", line 324, in run_wsgi
execute(self.server.app)
File "/usr/local/lib/python3.6/site-packages/werkzeug/serving.py", line 316, in execute
write(data)
File "/usr/local/lib/python3.6/site-packages/werkzeug/serving.py", line 277, in write
self.send_header(key, value)
File "/usr/lib64/python3.6/http/server.py", line 515, in send_header
("%s: %s\r\n" % (keyword, value)).encode('latin-1', 'strict'))
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0117' in position 59: ordinal not in range(256)
To be noted that the exception is thrown within the application server (Flask or Waitress), before the wopiserver can do anything about it, and it is independent on the storage interface.
Not sure how to address this...
We tracked it down to MacOS users: https://www.reddit.com/r/MacOS/comments/jhjv41/psa_beware_of_umlauts_and_other_accented/
Good finding, but how to mitigate that on our servers?
Actually your log seems to be more helpful then mine:
File "/usr/lib64/python3.6/http/server.py", line 515, in send_header
("%s: %s\r\n" % (keyword, value)).encode('latin-1', 'strict'))
Could be this line: https://github.com/cs3org/wopiserver/blob/295f5194f1b23c84056978d7ee0ac12a6d7fca6b/src/core/wopi.py#L150
EDIT: at least for oCIS, the filename
in the token looks like this:
"filename": "0f3753e8-40a3-48b8-98e0-3ab50975d4eb/◌̈.docx",
Could be this line:
How do you determine that it's that line from a stack trace that does not include the getFile()
function?
Exception while serving /wopi/iop/download
(see my log in the issue text)
calls getFile()
, which is defined here:
note, that it includes the line
OK, so actually your log is more useful - I didn't see the /wopi/iop/download
endpoint in my logs!
Then yes, it's likely the place that triggers the error, though still in a way that does not expose the stack trace. I can see if there's anything to "sanitize" a filename, but I'm afraid that any manipulation of such an invalid Unicode string would trigger this exception anyway.
According to https://datatracker.ietf.org/doc/html/rfc7230#section-3.2 headers need to be VCHAR, which is defined as any visible USASCII character. But there also seems an option to have an urlencoded value in there (more about that later).
When looking at the code and WOPI specification I wonder, why /wopi/iop/download
even gets called, because the WOPI protocol only does know about /wopi/files/(file_id)/contents
!?
Anyways... Both routes call the getFile method that returns the problematic header.
Also https://learn.microsoft.com/en-us/microsoft-365/cloud-storage-partner-program/rest/files/getfile#getfile does not specify a Content-Disposition
header, do you know where it comes from? I guess because the /wopi/iop/download
is supposed to allow users to download that file via the WOPI server and it gives the file a nice name?
If it's really not used / specified by the WOPI protocol, we could use URL encoding for the Content-Disposition
header because all major browsers seem to support it: https://stackoverflow.com/questions/2543584/unicode-in-content-disposition-header
Thanks for this investigation. Concerning why /wopi/iop/download
even gets called, the answer is in https://github.com/cs3org/wopiserver/blob/master/src/core/wopi.py#L77-L78
And yes, it looks like url-encoding the header may fix it. I'll give it a try.