bottlepy / bottle

bottle.py is a fast and simple micro-framework for python web-applications.
http://bottlepy.org/
MIT License
8.4k stars 1.46k forks source link

FileUploader: What does "unsafe charachters" mean? #1285

Closed oz123 closed 3 years ago

oz123 commented 3 years ago

The docs for FileUploader say:

Raw filename as sent by the client (may contain unsafe characters)

This may eventually deserve one more line explanation. Unsafe in what sense? Security? I found out that if files are uploaded with Unicode chars in them, then that property will have the correct file name, whereas filename completely removed all the non-ascii letters...

defnull commented 3 years ago

A (malicious) client can send arbitrary bytes and raw_filename may including dots, slashes, null-bytes or control characters. Using it directly to create a file on disk is dangerous. Thus, the warning and the much safer alternative. I find the description appropriate.

oz123 commented 3 years ago

Thanks for the quick response. The current filename property is very aggressive with it's security. Consider a filename which is not dangerous, simply with Hebrew characters:

In [5]:  normalize('NFKD', filename)
Out[5]: '_איני יכול עוד להרגיש הזדהות עם הקולקטיב הזה. תמחקו אותי_ - סוף שבוע - הארץ.pdf'

In [6]: fname = filename

In [7]:  fname.encode('ASCII', 'ignore').decode('ASCII')
Out[7]: '_       .  _ -   - .pdf'

This completely remove all non ascii letters. Also, German umlauts are mercilessly removed:

In [8]: fname = "Steuererklärung.pdf"

In [9]:  fname.encode('ASCII', 'ignore').decode('ASCII')
Out[9]: 'Steuererklrung.pdf'

How should I process file names with non ascii chars? Shall I always assume that these files are malicious? Would it not make sense to remove only slashes, null-bytes or control characters.?

Jwink3101 commented 3 years ago

Just a thought but what about seeing it can be safely encoded as UTF8? Then it's considered safe?

defnull commented 3 years ago

You can encode /etc/passwd or some<script src="">file.txt or control characters as utf-8 just fine. Alowing utf-8 means allowing all of unicode, including strange whitespace, controls and emojis.

defnull commented 3 years ago

The idea is that filename returns something absolutely safe, that can be used as a temp file name without any second thought. It is lossy, yes, but that's documented and intended.

If you want to store files with user-defined filenames, you need to think about acceptable names anyway and be very careful about the characters to allow. Just an example: by changing the script direction (\u200e and \u200f) you an create a file that visually ends in .pdf but actually ends in .exe. There are a whole group of security issues with allowing (malicious) users to name files on a server.

Bottle takes the as-safe-as-possible approach. If you need more, and know what you are doing, you can always use raw_filename and sanitise it yourself.

oz123 commented 3 years ago

I really appreciate how mindful you are to this issue. Is the following a good approximation for removing those malicious chars you mentioned? Also, I have not benchmarked this, but it might be faster than the current usage of re....

In [33]: s
Out[33]: 'this\x00isat\x00hes\n\r    \\d'

In [34]: s.translate(str.maketrans({'\x00': "", "\n": "", "\r": "", " ": "", "\\": ""}))
Out[34]: 'thisisathesd'

This of course does not solve the issue you pointed out with disguising exe files as pdf.

defnull commented 3 years ago

Blacklists do not work. Define a set of allowed characters and strip anything else.

For example, if similar looking character attacks are not an issue, and if you know that your file-system supports it, you could allow specific unciode categories (https://www.compart.com/en/unicode/category). That way control characters would be removed, but non-ascii letters and digits are still allowed. No idea how to implement that in Python < 3.8 though.

oz123 commented 3 years ago

Once again, thanks for the detailed reply. There are a lot of issues where you gave such wonderful answers. I think it's worth sieving through past issues and add these pearls to a section in the documentation. Would you accept such PRs?