Closed oz123 closed 3 years ago
A (malicious) client can send arbitrary bytes and raw_filename
may including dots, slashes, null-bytes or control characters. Using it directly to create a file on disk is dangerous. Thus, the warning and the much safer alternative. I find the description appropriate.
Thanks for the quick response. The current filename property is very aggressive with it's security. Consider a filename which is not dangerous, simply with Hebrew characters:
In [5]: normalize('NFKD', filename)
Out[5]: '_איני יכול עוד להרגיש הזדהות עם הקולקטיב הזה. תמחקו אותי_ - סוף שבוע - הארץ.pdf'
In [6]: fname = filename
In [7]: fname.encode('ASCII', 'ignore').decode('ASCII')
Out[7]: '_ . _ - - .pdf'
This completely remove all non ascii letters. Also, German umlauts are mercilessly removed:
In [8]: fname = "Steuererklärung.pdf"
In [9]: fname.encode('ASCII', 'ignore').decode('ASCII')
Out[9]: 'Steuererklrung.pdf'
How should I process file names with non ascii chars? Shall I always assume that these files are malicious? Would it not make sense to remove only slashes, null-bytes or control characters.
?
Just a thought but what about seeing it can be safely encoded as UTF8? Then it's considered safe?
You can encode /etc/passwd
or some<script src="">file.txt
or control characters as utf-8 just fine. Alowing utf-8 means allowing all of unicode, including strange whitespace, controls and emojis.
The idea is that filename
returns something absolutely safe, that can be used as a temp file name without any second thought. It is lossy, yes, but that's documented and intended.
If you want to store files with user-defined filenames, you need to think about acceptable names anyway and be very careful about the characters to allow. Just an example: by changing the script direction (\u200e
and \u200f
) you an create a file that visually ends in .pdf
but actually ends in .exe
. There are a whole group of security issues with allowing (malicious) users to name files on a server.
Bottle takes the as-safe-as-possible approach. If you need more, and know what you are doing, you can always use raw_filename
and sanitise it yourself.
I really appreciate how mindful you are to this issue.
Is the following a good approximation for removing those malicious chars you mentioned?
Also, I have not benchmarked this, but it might be faster than the current usage of re
....
In [33]: s
Out[33]: 'this\x00isat\x00hes\n\r \\d'
In [34]: s.translate(str.maketrans({'\x00': "", "\n": "", "\r": "", " ": "", "\\": ""}))
Out[34]: 'thisisathesd'
This of course does not solve the issue you pointed out with disguising exe
files as pdf
.
Blacklists do not work. Define a set of allowed characters and strip anything else.
For example, if similar looking character attacks are not an issue, and if you know that your file-system supports it, you could allow specific unciode categories (https://www.compart.com/en/unicode/category). That way control characters would be removed, but non-ascii letters and digits are still allowed. No idea how to implement that in Python < 3.8 though.
Once again, thanks for the detailed reply. There are a lot of issues where you gave such wonderful answers. I think it's worth sieving through past issues and add these pearls to a section in the documentation. Would you accept such PRs?
The docs for FileUploader say:
This may eventually deserve one more line explanation. Unsafe in what sense? Security? I found out that if files are uploaded with Unicode chars in them, then that property will have the correct file name, whereas
filename
completely removed all the non-ascii letters...