freedomofpress / dangerzone

Take potentially dangerous PDFs, office documents, or images and convert them to safe PDFs
https://dangerzone.rocks/
GNU Affero General Public License v3.0
3.36k stars 153 forks source link

UnicodeEncodeError when printing a filename with invalid UTF-8 after conversion #768

Closed naglis closed 2 months ago

naglis commented 2 months ago

If a filename contains invalid UTF-8 sequences, dangerzone-cli fails to print the filename of the successfully converted/failed document due to an UnicodeEncodeError.

Steps to reproduce (use bash for ANSI-C quoting):

Printing filenames of successfully converted documents:

curl -sSL 'https://raw.githubusercontent.com/freedomofpress/dangerzone/main/tests/test_docs/sample-pdf.pdf' -o $'sample\xf0.pdf'
dangerzone-cli $'sample\xf0.pdf'
Output ``` ╭──────────────────────────╮ │ ▄██▄ │ │ ██████ │ │ ███▀▀▀██ │ │ ███ ████ │ │ ███ ██████ │ │ ███ ▀▀▀▀████ │ │ ███████ ▄██████ │ │ ███████ ▄█████████ │ │ ████████████████████ │ │ ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀ │ │ │ │ Dangerzone v0.6.0 │ │ https://dangerzone.rocks │ ╰──────────────────────────╯ Assigning ID 'D-TevQ' to doc '/tmp/sample_.pdf' Converting document to safe PDF > /usr/bin/podman run --network none -u dangerzone --log-driver none --security-opt no-new-privileges --userns keep-id --cap-drop all --rm -i dangerzone.rocks/dangerzone /usr/bin/python3 -m dangerzone.conversion.doc_to_pixels [doc D-TevQ] 0% Converting page 1/4 to pixels [doc D-TevQ] 12% Converting page 2/4 to pixels [doc D-TevQ] 24% Converting page 3/4 to pixels [doc D-TevQ] 36% Converting page 4/4 to pixels [doc D-TevQ] 49% Converted document to pixels > /usr/bin/podman run --network none -u dangerzone --log-driver none --security-opt no-new-privileges --userns keep-id --cap-drop all --rm -i -v /tmp/tmphot68xih:/safezone:Z -e OCR=0 -e OCR_LANGUAGE=None dangerzone.rocks/dangerzone /usr/bin/python3 -m dangerzone.conversion.pixels_to_pdf [doc D-TevQ] 50% Converting page 1/4 from pixels to PDF [doc D-TevQ] 61% Converting page 2/4 from pixels to PDF [doc D-TevQ] 72% Converting page 3/4 from pixels to PDF [doc D-TevQ] 83% Converting page 4/4 from pixels to PDF [doc D-TevQ] 100% Safe PDF created Safe PDF(s) created successfully Traceback (most recent call last): File "/usr/bin/dangerzone-cli", line 33, in sys.exit(load_entry_point('dangerzone==0.6.0', 'console_scripts', 'dangerzone-cli')()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/site-packages/dangerzone/errors.py", line 103, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/site-packages/dangerzone/cli.py", line 101, in cli_main click.echo(document.output_filename) File "/usr/lib/python3.11/site-packages/click/utils.py", line 318, in echo file.write(out) # type: ignore ^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/site-packages/colorama/ansitowin32.py", line 47, in write self.__convertor.write(text) File "/usr/lib/python3.11/site-packages/colorama/ansitowin32.py", line 179, in write self.wrapped.write(text) UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf0' in position 11: surrogates not allowed ```

Printing filenames of documents that failed to convert:

curl -sSL 'https://raw.githubusercontent.com/freedomofpress/dangerzone/main/tests/test_docs/sample_bad_pdf.pdf' -o $'sample-bad\xf0.pdf'
dangerzone-cli $'sample-bad\xf0.pdf'
Output ``` ╭──────────────────────────╮ │ ▄██▄ │ │ ██████ │ │ ███▀▀▀██ │ │ ███ ████ │ │ ███ ██████ │ │ ███ ▀▀▀▀████ │ │ ███████ ▄██████ │ │ ███████ ▄█████████ │ │ ████████████████████ │ │ ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀ │ │ │ │ Dangerzone v0.6.0 │ │ https://dangerzone.rocks │ ╰──────────────────────────╯ Assigning ID 'UWTbXN' to doc '/tmp/sample-bad_.pdf' Converting document to safe PDF > /usr/bin/podman run --network none -u dangerzone --log-driver none --security-opt no-new-privileges --userns keep-id --cap-drop all --rm -i dangerzone.rocks/dangerzone /usr/bin/python3 -m dangerzone.conversion.doc_to_pixels ERROR [doc UWTbXN] 0% The document format is not supported Failed to convert document(s) Traceback (most recent call last): File "/usr/bin/dangerzone-cli", line 33, in sys.exit(load_entry_point('dangerzone==0.6.0', 'console_scripts', 'dangerzone-cli')()) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/site-packages/dangerzone/errors.py", line 103, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/site-packages/dangerzone/cli.py", line 111, in cli_main click.echo(document.input_filename) File "/usr/lib/python3.11/site-packages/click/utils.py", line 318, in echo file.write(out) # type: ignore ^^^^^^^^^^^^^^^ File "/usr/lib/python3.11/site-packages/colorama/ansitowin32.py", line 47, in write self.__convertor.write(text) File "/usr/lib/python3.11/site-packages/colorama/ansitowin32.py", line 179, in write self.wrapped.write(text) UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf0' in position 15: surrogates not allowed ```

Tested on v0.6.0 on Arch Linux. I was unable to reproduce it via GUI as I was not able to select files with invalid UTF-8 sequences in the filename in the file dialog (such files are not displayed - possible Qt issue?). Update: GUI is not affected if passing the filename when starting dangerzone.

Since the filenames are printed at the end after all conversions, IIUC it does not impact the conversions, but the exception prevents the filename and and any subsequent filenames from being printed, and in case all documents were converted successfully it also impacts the exit code.

apyrgio commented 2 months ago

(such files are not displayed - possible Qt issue?)

Ok, that's unexpected.

Since the filenames are printed at the end after all conversions, IIUC it does not impact the conversions, but the exception prevents the filename and and any subsequent filenames from being printed, and in case all documents were converted successfully it also impacts the exit code.

Nice catch! Thanks a lot for reporting this. That was actually a security concern of ours, and we had added a special function to handle any character in a filename that is not printable (see https://github.com/freedomofpress/dangerzone/pull/491/commits/cfa0c01d8fe87d2f3ff187a93a5e984d64352067). So, it seems that we didn't catch everything.

naglis commented 2 months ago

and we had added a special function to handle any character in a filename that is not printable (see cfa0c01). So, it seems that we didn't catch everything.

Indeed, I saw replace_control_chars and was wondering why it was not used in all cases where the filenames are printed, just in announce_id and was meaning to ask it in the PR, but ran into a few test failures. IIUC the issue here is writing surrogate escapes to a stream that does not support it, the circumstances are quite nicely summed up in click.format_filename documentation.