freedomofpress / dangerzone

Take potentially dangerous PDFs, office documents, or images and convert them to safe PDFs
https://dangerzone.rocks/
GNU Affero General Public License v3.0
3.39k stars 155 forks source link

XML/ZIP/OctetStream File Mime Type Disambiguation #688

Open deeplow opened 5 months ago

deeplow commented 5 months ago

Many file are in fact zip files with a particular structure which is then interpreted by the file viewer. This fact makes these filetypes particularly difficult to identify mime type because they can be multiple file formats. Examples of files that have application/zip mimetype:

Then we also have files which are XML-based and are thus identified as text/xml:

My suggestion would be to somehow expose the file extension to the conversion process to help identify the file type.