digital-preservation / droid

DROID (Digital Record and Object Identification)
BSD 3-Clause "New" or "Revised" License
280 stars 75 forks source link

Microsoft word form identified as XML #430

Open elbre opened 4 years ago

elbre commented 4 years ago

Good day, during the testing of our files, we stepped upon the one interesting case. In the Microsoft Office is creatable the form which will be eventually filled. Such a document is saved as an ordinary .doc file. BUT! If such a file is going through Droid it is identified as XML. So eventually LTP system could try to open such a file as the XML which would be a quite fatal problem. I am not sure if this is fixable or actually a bug but I felt it is important to report it.

sparkhi commented 4 years ago

Hi @elbre If you have a sample file that you used, and it does not have any sensitive information, would you be kind enough to attach that file to the issue as well. It will really help when someone looks at the issue in future. Many thanks

Dclipsham commented 4 years ago

unable to replicate with form templates on Office 2016. There is a Word XML format (via save as) that outputs a raw XML file, but also gives it an XML extension - this re-opens the file as a form in Word but not any other application (LibreOffice just shows the XML for instance).

This Word XML format looks like it would be pretty easy to assign a signature to (has a very clear tag '<?mso-application progid="Word.Document"?>' immediately after the XML header) so we'll probably do that anyway, but I'm not sure if this is what's happened here without seeing a sample, or understanding the steps to reproduce...

elbre commented 4 years ago

Hi again, I am sorry it took me a bit to get back with a sample. I needed to clear out the sensitive data and was assured that it is still appears as the xml file for Droid. text-doc-xml.zip

Dclipsham commented 4 years ago

Thank you. Can you describe the process for creating this file? Do you know which version of MS Office was used?

elbre commented 4 years ago

I am not aware of the details about creating. It is the document from 2010 which should be archived. It was created at the Microsoft Word 97-2003.

jcharlet commented 4 years ago

This Word XML format looks like it would be pretty easy to assign a signature to (has a very clear tag '' immediately after the XML header) so we'll probably do that anyway, but I'm not sure if this is what's happened here without seeing a sample, or understanding the steps to reproduce...

So something to handle in Pronom rather than in Droid @Dclipsham ?