DH-IT-Portal-Development / ethics

Ethical Committee web application in Django
http://fetc.hum.uu.nl
MIT License
2 stars 1 forks source link

Add application/zip as mimetype for docx files #678

Open miggol opened 3 months ago

miggol commented 3 months ago

The following can be tested with the file FETC-LK-23-053-03-RedactedLastname-Extra4-Informed_Consent.docx.

Although docx2txt can extract text from this file perfectly well, this is never attempted because its mimetype is detected by magic as application/zip. Instead main.utils.get_document_contents() returns "No text found".

This adds another entry to the list of mimetypes that Word documents occasionally masquerade as. We might consider just letting docx2txt have a go at anything that isn't explicitly a PDF, rather than keeping this list of mimetypes up to date.

Marking this as low-hanging fruit.