freedomofpress / dangerzone

Take potentially dangerous PDFs, office documents, or images and convert them to safe PDFs
https://dangerzone.rocks/
GNU Affero General Public License v3.0
3.59k stars 170 forks source link

Support Remaining File Formats that PyMuPDF Supports (MOBI, FB2, CBZ, TXT, PGM, PSD) #660

Open Scripter17 opened 9 months ago

Scripter17 commented 9 months ago

I was a bit surprised to see Dangerzone doesn't support epub files and even more surprised to see there's not a single issue/PR about it. Unless Dangerzone/Freedom of the Press has some kind of anti-piracy policy (similar to youtube-dl) then I see no real reason to not have this

It may be possible to simply include calibre in the sandbox to support every format it supports that Dangerzone doesn't

deeplow commented 9 months ago

Thanks for the suggestion! Yes, we do intend to add more formats to Dangerzone. We are currently in the process of replacing our core conversion component with one that does support many other file formats, so that will be trivial do implement once we've done that.

We'll essentially be able to add the following file formats once this is complete:

Screenshot 2023-12-22 at 08-02-52 Features Comparison - PyMuPDF 1 23 8 documentation

deeplow commented 8 months ago

.cbz we probably won't be able to include at the moment since it's a zip archive by it's mime type and on the container we currently don't have access to the original file extension. According to wikipedia, this may also be application/vnd.comicbook+zip sometimes. So let's add that for now.

deeplow commented 8 months ago

I couldn't find a mime type for the PAM image format. I'll drop it here. The creators of a library that parses it explain (or add to) this confusion:

The Confusing Universe of Netpbm Formats

It is easy to get confused about the relationship between the PAM format and PBM, PGM, PPM, and PNM. Here is a little enlightenment:

"PNM" is not really a format. It is a shorthand for the PBM, PGM, and PPM formats collectively. It is also the name of a group of library functions that can each handle all three of those formats.

"PAM" is in fact a fourth format. But it is so general that you can represent the same information in a PAM image as you can in a PBM, PGM, or PPM image. And in fact a program that is designed to read PBM, PGM, or PPM and does so with a recent version of the Netpbm library will read an equivalent PAM image just fine and the program will never know the difference.

To confuse things more, there is a collection of library routines called the "pam" functions that read and write the PAM format, but also read and write the PBM, PGM, and PPM formats. They do this because the latter formats are much older and more popular, so even a new program must work with them. Having the library handle all the formats makes it convenient to write programs that use the newer PAM format as well.

deeplow commented 8 months ago

Also, on Mupdf some references to office formats (including .hwp). I wonder what those are for since mupdf does not support these formats. One debug comment seems to hint at the fact that they convert these files to html but html is also not supported :thinking:.

No references point to this actually being supported. So I'll stop digging here. But I found it curious.

deeplow commented 8 months ago

.cbz we probably won't be able to include at the moment since it's a zip archive by it's mime type and on the container we currently don't have access to the original file extension. According to wikipedia, this may also be application/vnd.comicbook+zip sometimes. So let's add that for now.

I am running into similar issues with the .xps file format. Guessing it from file contents alone reveal application/zip similar to what we had experienced with LibreOffice files. And this isn't because of some odd tool doing a bad job. I just converted our sample-docx.docx to .xps with Microsoft Office and it showed application/zip as the mime type.

deeplow commented 8 months ago

This is proving to be a bit more challenging than I originally anticipated because we have different PyMuPDF versions running. Particularly in Qubes OS.

File Format MuPDF Min Supported Version Notes
.psd 1.23.0 (2023-08-22)
.txt 1.23.6 (according the changelog) but in practice only works in in 1.23.7
.jxr 1.10-rc1 Server fails due to missing codec: code=2: JPEG-XR codec is not available (on fedora installing jxrlib, jxrlib-devel or openjpeg-libs didn't help`
.pgm ? (couldn't find)
.mobi MuPDF 1.21.0-rc1
.fb2 MuPDF 1.10-rc1 In practice this file is cannot be detected by the mimetype alone
deeplow commented 8 months ago

With .jpx I wasn't finding any documentation on how to convert to this file type under linux. Supposedly the convert can generate one such file and the magic number matches that of on the respective wikipedia article.

However, PyMuPDF still rejected this file. So I won't be adding it for now.

deeplow commented 7 months ago

Renamed the issue to the remaining file formats https://github.com/freedomofpress/dangerzone/pull/697