cs3org / wopiserver

A vendor-neutral application gateway compatible with the WOPI specifications.
Apache License 2.0
52 stars 27 forks source link

"special" file names with extended Unicode characters can lead to traceback #124

Closed wkloucek closed 1 year ago

wkloucek commented 1 year ago
wopi-wopiserver-7c79c74b7c-6vksp wopiserver {"time": "2023-06-23T15:51:03.816", "host": "wopi-wopiserver-7c79c74b7c-6vksp", "level": "ERROR", "process": "waitress", Exception while serving /wopi/iop/download}
wopi-wopiserver-7c79c74b7c-6vksp wopiserver Traceback (most recent call last):
wopi-wopiserver-7c79c74b7c-6vksp wopiserver   File "/usr/local/lib/python3.11/site-packages/waitress/channel.py", line 428, in service
wopi-wopiserver-7c79c74b7c-6vksp wopiserver     task.service()
wopi-wopiserver-7c79c74b7c-6vksp wopiserver   File "/usr/local/lib/python3.11/site-packages/waitress/task.py", line 168, in service
wopi-wopiserver-7c79c74b7c-6vksp wopiserver     self.execute()
wopi-wopiserver-7c79c74b7c-6vksp wopiserver   File "/usr/local/lib/python3.11/site-packages/waitress/task.py", line 472, in execute
wopi-wopiserver-7c79c74b7c-6vksp wopiserver     self.write(chunk)
wopi-wopiserver-7c79c74b7c-6vksp wopiserver   File "/usr/local/lib/python3.11/site-packages/waitress/task.py", line 306, in write
wopi-wopiserver-7c79c74b7c-6vksp wopiserver     rh = self.build_response_header()
wopi-wopiserver-7c79c74b7c-6vksp wopiserver          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
wopi-wopiserver-7c79c74b7c-6vksp wopiserver   File "/usr/local/lib/python3.11/site-packages/waitress/task.py", line 279, in build_response_header
wopi-wopiserver-7c79c74b7c-6vksp wopiserver     return res.encode("latin-1")
wopi-wopiserver-7c79c74b7c-6vksp wopiserver            ^^^^^^^^^^^^^^^^^^^^^
wopi-wopiserver-7c79c74b7c-6vksp wopiserver UnicodeEncodeError: 'latin-1' codec can't encode character '\u0308' in position 80: ordinal not in range(256)
wkloucek commented 1 year ago

cc @micbar I encountered this in a project where a user seems to have a file name containing the ◌̈character :laughing:

glpatcern commented 1 year ago

Interesting, we got a similar case with a Activitės in the filename (not the usual ` accent), and with a similar traceback:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/werkzeug/serving.py", line 324, in run_wsgi
    execute(self.server.app)
  File "/usr/local/lib/python3.6/site-packages/werkzeug/serving.py", line 316, in execute
    write(data)
  File "/usr/local/lib/python3.6/site-packages/werkzeug/serving.py", line 277, in write
    self.send_header(key, value)
  File "/usr/lib64/python3.6/http/server.py", line 515, in send_header
    ("%s: %s\r\n" % (keyword, value)).encode('latin-1', 'strict'))
UnicodeEncodeError: 'latin-1' codec can't encode character '\u0117' in position 59: ordinal not in range(256)

To be noted that the exception is thrown within the application server (Flask or Waitress), before the wopiserver can do anything about it, and it is independent on the storage interface.

Not sure how to address this...

wkloucek commented 1 year ago

We tracked it down to MacOS users: https://www.reddit.com/r/MacOS/comments/jhjv41/psa_beware_of_umlauts_and_other_accented/

glpatcern commented 1 year ago

Good finding, but how to mitigate that on our servers?

wkloucek commented 1 year ago

Actually your log seems to be more helpful then mine:

  File "/usr/lib64/python3.6/http/server.py", line 515, in send_header
    ("%s: %s\r\n" % (keyword, value)).encode('latin-1', 'strict'))

Could be this line: https://github.com/cs3org/wopiserver/blob/295f5194f1b23c84056978d7ee0ac12a6d7fca6b/src/core/wopi.py#L150

EDIT: at least for oCIS, the filename in the token looks like this:

"filename": "0f3753e8-40a3-48b8-98e0-3ab50975d4eb/◌̈.docx",
glpatcern commented 1 year ago

Could be this line:

How do you determine that it's that line from a stack trace that does not include the getFile() function?

wkloucek commented 1 year ago

Exception while serving /wopi/iop/download (see my log in the issue text)

https://github.com/cs3org/wopiserver/blob/295f5194f1b23c84056978d7ee0ac12a6d7fca6b/src/wopiserver.py#L370-L385

calls getFile(), which is defined here:

https://github.com/cs3org/wopiserver/blob/295f5194f1b23c84056978d7ee0ac12a6d7fca6b/src/core/wopi.py#L133-L162

note, that it includes the line

https://github.com/cs3org/wopiserver/blob/295f5194f1b23c84056978d7ee0ac12a6d7fca6b/src/core/wopi.py#L150

glpatcern commented 1 year ago

OK, so actually your log is more useful - I didn't see the /wopi/iop/download endpoint in my logs!

Then yes, it's likely the place that triggers the error, though still in a way that does not expose the stack trace. I can see if there's anything to "sanitize" a filename, but I'm afraid that any manipulation of such an invalid Unicode string would trigger this exception anyway.

wkloucek commented 1 year ago

According to https://datatracker.ietf.org/doc/html/rfc7230#section-3.2 headers need to be VCHAR, which is defined as any visible USASCII character. But there also seems an option to have an urlencoded value in there (more about that later).

When looking at the code and WOPI specification I wonder, why /wopi/iop/download even gets called, because the WOPI protocol only does know about /wopi/files/(file_id)/contents!?

Anyways... Both routes call the getFile method that returns the problematic header.

Also https://learn.microsoft.com/en-us/microsoft-365/cloud-storage-partner-program/rest/files/getfile#getfile does not specify a Content-Disposition header, do you know where it comes from? I guess because the /wopi/iop/download is supposed to allow users to download that file via the WOPI server and it gives the file a nice name?

If it's really not used / specified by the WOPI protocol, we could use URL encoding for the Content-Disposition header because all major browsers seem to support it: https://stackoverflow.com/questions/2543584/unicode-in-content-disposition-header

glpatcern commented 1 year ago

Thanks for this investigation. Concerning why /wopi/iop/download even gets called, the answer is in https://github.com/cs3org/wopiserver/blob/master/src/core/wopi.py#L77-L78

And yes, it looks like url-encoding the header may fix it. I'll give it a try.