The following can be tested with the file FETC-LK-23-053-03-RedactedLastname-Extra4-Informed_Consent.docx.
Although docx2txt can extract text from this file perfectly well, this is never attempted because its mimetype is detected by magic as application/zip. Instead main.utils.get_document_contents() returns "No text found".
This adds another entry to the list of mimetypes that Word documents occasionally masquerade as. We might consider just letting docx2txt have a go at anything that isn't explicitly a PDF, rather than keeping this list of mimetypes up to date.
The following can be tested with the file
FETC-LK-23-053-03-RedactedLastname-Extra4-Informed_Consent.docx
.Although
docx2txt
can extract text from this file perfectly well, this is never attempted because its mimetype is detected by magic asapplication/zip
. Insteadmain.utils.get_document_contents()
returns "No text found".This adds another entry to the list of mimetypes that Word documents occasionally masquerade as. We might consider just letting docx2txt have a go at anything that isn't explicitly a PDF, rather than keeping this list of mimetypes up to date.
Marking this as low-hanging fruit.