Prevent user from using illegal characters in output filename

bnewc commented 3 weeks ago

This should resolve issue #362 by validating that output filenames contain no illegal characters.

I've used the Python re module for validation, and created IllegalOutputFilenameException to be raised in the case of an invalid filename. I also updated test_document.py to include a unit test for this new exception type.

Filenames are checked against a universal set of illegal characters: <>:\"|?*. I believe it's best practice to make the character set system-agnostic, though someone with more knowledge of the security implications could update the feature to derive the character set directly from the host system using sanitize_filename from the pathvalidate library.

My branch is open to edits if there are any changes that need to be made.

almet commented 2 weeks ago

Hi, and thanks for your work on this :-)

After a quick check on the python docs, I found that by default, sanitize_pathname targets all platforms, so we can probably let the library take care of the checks directly, and remove the regexp:

The default target platform is universal. i.e. the sanitized file name is valid for any platform.

Also, the CI is currently failing because of our linting rules. It asks to apply the following diff (replacing double quotes with simple ones):

--- dangerzone/errors.py    2024-06-12 02:21:28.668833+00:00
+++ dangerzone/errors.py    2024-06-12 02:21:57.251937+00:00
@@ -44,11 +44,11 @@

 class IllegalOutputFilenameException(DocumentFilenameException):
     """Exception for when the output file contains illegal characters."""

     def __init__(self) -> None:
-        super().__init__("Filename must not contain the following characters: <>:\"|?*")        
+        super().__init__('Filename must not contain the following characters: <>:"|?*')

In case it helps, you can reproduce the linter work by running this command locally:

poetry run make lint

almet commented 2 weeks ago

After a quick check on the python docs, I found that by default, sanitize_pathname targets all platforms, so we can probably let the library take care of the checks directly, and remove the regexp:

I just realized that pathvalidate is not in the standard python library, and this changes my thoughts.

I'm wondering if we should take the same approach that's the Django slugify() is taking, which seem to cope well with unicode chars by using unicodedata.normalize("NFKC", text).

bnewc commented 2 weeks ago

After making the linter's suggested changes, it still takes issue with that line, suggesting I change it to be... exactly the same?

--- /root/project/dangerzone/errors.py  2024-06-12 15:36:20.384450+00:00
+++ /root/project/dangerzone/errors.py  2024-06-12 15:36:50.243414+00:00
@@ -44,11 +44,11 @@

 class IllegalOutputFilenameException(DocumentFilenameException):
     """Exception for when the output file contains illegal characters."""

     def __init__(self) -> None:
-        super().__init__('Filename must not contain the following characters: <>:"|?*')        
+        super().__init__('Filename must not contain the following characters: <>:"|?*')

bnewc commented 2 weeks ago

I just realized that pathvalidate is not in the standard python library, and this changes my thoughts.

That's why I was also uncertain about using it :)

Could you clarify what you're thinking about the slugify() function? I'm not sure how checking if the text is Unicode or ASCII would catch illegal characters a user may enter, since realistically they'd be a subset of Unicode or ASCII.

bnewc commented 2 weeks ago

After syncing my branch with the main dev branch, two new test failures are occurring.

A conversion error in Bookworm for the sample-hwp.hwp test:

ERROR    dangerzone.isolation_provider.base:base.py:170 [doc z1s5Yj] 0% \ 
Some unexpected error occurred while converting the document

A Dangerzone build error in Focal:

Error: error building at STEP "RUN apt-get update     && apt-get install -y \
--no-install-recommends dh-python make build-essential git fakeroot libqt5gui5 \
libxcb-cursor0 pipx python3 python3-dev python3-venv python3-stdeb \
python3-all     && rm -rf /var/lib/apt/lists/*": error while running runtime: exit status 100

I believe this is unrelated to my additions, but I wanted to point it out.

almet commented 2 weeks ago

-        super().__init__('Filename must not contain the following characters: <>:"|?*')        
+        super().__init__('Filename must not contain the following characters: <>:"|?*')

This is actually fixing the extra space on the top line. You can run the black formatter locally with poetry run make black .

Could you clarify what you're thinking about the slugify() function? I'm not sure how checking if the text is Unicode or ASCII would catch illegal characters a user may enter, since realistically they'd be a subset of Unicode or ASCII.

My thinking was about taking a different way than the one you provided, and rather than displaying an error, replace the "unsafe" chars with safer ones.

After some more thinking, I believe the approach you provide here makes more sense, because some of the chars that we would consider "unsafe" can actually meaningful (think accents, for instance). Sorry for bothering you with this :roll_eyes:

Let me know when you've had the time to include the linting changes, and we should be good to go!

It's possible that the tests failures you're been seeing are now fixed in the main branch. If it's the case, rebasing on top of the latest main branch would make them go away (thanks for reporting)

bnewc commented 2 weeks ago

Currently, the commit is throwing an exception on the following test in Windows:

 filename = 'D:\\a\\dangerzone\\dangerzone\\something_else.pdf'

This indicates a challenge with the validation structure. validate_output_filename(), which I modified, checks not only the user extension but also the full filepath, which on Windows will include a colon and backslashes. On Linux it will include forward slashes. These are all illegal characters for filenames, but not for filepaths. Interestingly, on the Windows GUI build, if you include a backslash in your safe extension name, the document will successfully convert and saved under only the extension name in the selected directory. For instance, choosing the safe extension \safe for questionable.docx will output filepath save_dir\safe.pdf, rather than save_dir\questionable\safe.pdf.

For now, I can remove the colon from the illegal character set to make the code functional, and effectively warn users about other illegal characters, but this will not fully address issue #362 . We need to either use something like sanitize_filename, or restructure the validation process.

apyrgio commented 2 weeks ago

Hey @bnewc, thanks a lot for you work on this!

You know what, a quick fix for the above would be to get just the filename (e.g., with pathlib.Path.name), and perform your check for just that part. This way, you can keep the : and other characters (such as / and \) in the illegal character set.

I'm afraid though that even this check has its caveats. Linux filesystems allow the use of virtually every character, and users may already have such files. It would not be nice to error on these files, when they are actually perfectly acceptable. So, my suggestion would be to use a different illegal character set, per OS. For Windows, we can target NTFS, and for macOS we can target HFS+. You can read more here.

Finally, note that the main idea of #362 was to not let the user proceed, until the safe extension contains legal characters. So, the best way to tackle this would be to also add check for illegal chars in this method:

https://github.com/freedomofpress/dangerzone/blob/e81ecbc288d56f4f01d058b011e60156ca5d66e6/dangerzone/gui/main_window.py#L768-L780

In any case, the PR is in a good direction, so I would consider this part as a good-to-have, only if you have time to work on it.

apyrgio commented 1 week ago

Hi @bnewc. We're in the process of releasing 0.7.0, so I wanted to ask if you still have time to work on this issue. If not, that's totally fine, we can continue from where you left off and merge it.

Thanks again for all your work :slightly_smiling_face:

bnewc commented 1 week ago

I'll see what I can do!

bnewc commented 4 days ago

The GUI should now inform users of illegal characters in the output safe extension, and prevent conversion until the characters are removed.

However, the commit is still failing a number of checks. The linter check issue appears to be related to the Ubuntu Focal workaround for PySide2. I see how to fix that and will do so when I have a bit more time.

Additionally, it fails checks that expect an IllegalOutputFilenameException where none is raised. This is due to a discrepancy between the characters considered illegal by the test, and those considered illegal by the standards @apyrgio linked for Linux (where only \ is illegal). If someone would advise me on how to proceed, I can either make the tests match the HFS+ standard, or make the validation stricter to comply with the tests.

apyrgio commented 1 day ago

Hi @bnewc. Just writing here quick that we're currently in the middle of releasing Dangerzone 0.7.0, and I haven't found the time to look into the latest changes in your PR. Once I manage to find some time though, I'll definitely get back to this. Thanks for your patience :slightly_smiling_face:

freedomofpress / dangerzone

Prevent user from using illegal characters in output filename #834