freedomofpress / dangerzone

Take potentially dangerous PDFs, office documents, or images and convert them to safe PDFs
https://dangerzone.rocks/
GNU Affero General Public License v3.0
3.37k stars 153 forks source link

QA and Release 0.6.0 #704

Closed deeplow closed 4 months ago

deeplow commented 4 months ago

Pre-release

Before making a release, all of these should be complete:

QA

To ensure that new releases do not introduce regressions, and support existing and newer platforms, we have to do the following:

(see also the previous QA for 0.5.1)

Publishing the Release

deeplow commented 4 months ago

Fedora 39

QA passed on Fedora 39. However, I noticed that a document failure (sample_bad) yields a different message now due to the page streaming. It says that the conversion was interrupted rather than some file format issue. Not sure if this was intentional. I'd have to check the code. But we may want try and catch this error on the server side and pass it along to the user.

deeplow commented 4 months ago

While testing on ubuntu, I noticed that interrupted conversions (sample_bad...) show a traceback instead of simply an error message. We should show an error message:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/dangerzone/isolation_provider/base.py", line 74, in convert
    self.doc_to_pixels(document, t)
  File "/usr/lib/python3/dist-packages/dangerzone/isolation_provider/base.py", line 106, in doc_to_pixels
    n_pages = read_int(p.stdout)
  File "/usr/lib/python3/dist-packages/dangerzone/isolation_provider/base.py", line 38, in read_int
    raise errors.InterruptedConversionException()
dangerzone.conversion.errors.InterruptedConversionException: Something interrupted the conversion and it could not be completed.
apyrgio commented 4 months ago

Make sure that the tip of the main branch passes the CI tests.

The latest released container image (v0.5.1) does not actually pass security scanning, due to CVE-2023-5841. This will be fixed once we release the new image, but for all intents and purposes, our CI tests now pass.

deeplow commented 4 months ago

Qubes (Fedora 39 template)

Works well but we also need to add our repo before installing the locally built RPM. This is to be expected with the new PySide6 dependency from the dangerzone repos.

Notes:

apyrgio commented 4 months ago

Windows

I've tested building a Docker image on Windows, and we have a minor issue. The end result is a corrupted image. More specifically, the tar archive starts with the following string:

failed to get console mode for stdout: The handle is invalid.

This string is returned by docker save (see https://github.com/docker/for-win/issues/13891). It seems to be a regression in Docker that was introduced recently, and will be fixed soon.

In the meantime, we are not super impacted by this error, because we build this image on a different environment.


EDIT: Turns out that this is not the case. Even if we build the image in a separate environment, the user needs to ingest it. Unfortunately, the Docker call that fetches the current image always includes the faulty string. this means that Dangerzone will always believe that the container image is not installed, and will install it every time.

Note that this is something that affects existing installations as well, that update to the latest Docker Desktop.

deeplow commented 4 months ago

Note that this is something that affects existing installations as well, that update to the latest Docker Desktop.

This sucks. We can tackle this in code, but this adds one more reason to implement https://github.com/freedomofpress/dangerzone/issues/693

apyrgio commented 4 months ago

The macOS QA has finished successfully, and with that, we're done with the first round of the QA. On slightly related note, I found out that we couldn't select any .svg and .bpm files via the file browser. Turns out that we had omitted those by mistake. I sent a PR for this: https://github.com/freedomofpress/dangerzone/pull/722

deeplow commented 4 months ago

Large tests resuts

The large test set is a large set of documents. I ran the smallest of the sets, containing a total of 4962.

Comparing v0.5.0 to v0.6.0 we have the following

commit_v0.5.0-15-g8d70393 commit_v0.6.0-rc2-1-gd1000f8
full test time 9740.880 33984.981 *
test failures 66 49 **

* This means that some tests are taking and absurd amount of time. And because we removed timeouts some documents just took forever.

** The number of failures are the exact same when we consider that we removed timeouts. In v0.5.0 17 tests would just timeout while now they pass but take a long time. See


Why the time difference?

Digging into the data we have the following:

results

The files are sorted by the time they took and these were the ones that took the longest. As you can see v0.6.0 (with PyMuPDF) has 17 files that take 1000+ seconds. So much time so that without these slow tests these would be 4737, which is about half of what v0.5.0. This is consistent with some results we did previously, it's just that when we did those we still had timeouts and so they'd be terminated and fail after 65 seconds, while they now pass but take forever

Conclusion

PyMuPDF doesn't seem to add more time to the conversions nor increase the failures. Quite the contrary.

deeplow commented 4 months ago

:white_check_mark: Update the Installing Dangerzone page [nothing changed]

@apyrgio I checked this one because I don't think we needed to change anything there. Pinging you to double-check when you have availability.

GWeck commented 4 months ago

Upgrading Dangerzone 0.5.0 to 0.6.0 on Fedora 38 may / will break the OCR component. This can be easily fixed by appending the line

    export TESSDATA_PREFIX=/usr/share/tesseract/tessdata

to the file .bash_profile in the disposable template used for the Dangerzone dispVM.

apyrgio commented 4 months ago

Great catch! We had tested on Qubes a dev build for Fedora 38 templates, and a production build for Fedora 39 templates. And yet we missed it :grimacing: . The reason we missed it is:

  1. The PyMuPDF version on Fedora 39 is 1.23.3, which accepts the Tesseract data path as a separate argument. Our code checks for the PyMuPDF version and does pass the correct path:

    https://github.com/freedomofpress/dangerzone/blob/d35eb56b4b53644bff78d4c79cf0cb6d08fae9c9/dangerzone/conversion/common.py#L24-L28

    https://github.com/freedomofpress/dangerzone/blob/d35eb56b4b53644bff78d4c79cf0cb6d08fae9c9/dangerzone/conversion/pixels_to_pdf.py#L60-L72

  2. The dev script on Qubes has this line, so that's why the issue did not manifest on our local tests:

    https://github.com/freedomofpress/dangerzone/blob/d35eb56b4b53644bff78d4c79cf0cb6d08fae9c9/dev_scripts/dangerzone#L7-L9

deeplow commented 4 months ago

I have moved the above issue to its own issue https://github.com/freedomofpress/dangerzone/issues/737

GWeck commented 4 months ago

I checked now with Qubes R4.2.0 and Fedora 39. There everything is o.k.

GWeck commented 4 months ago

Now I checked the new version with Fedora 38. It's o.k., too. Thank you for correcting this so quickly!