digital-preservation / droid

DROID (Digital Record and Object Identification)
BSD 3-Clause "New" or "Revised" License
286 stars 75 forks source link

Particular XLSX file not recognised #20

Closed davidread closed 11 years ago

davidread commented 11 years ago

We have a government Excel spreadsheet that is not recognised by Droid.

The file is: http://data.defra.gov.uk/QDS/defra-qds-1204.xlsx and we have a cached copy linked from here: http://data.gov.uk/dataset/defra-business-plan-quarterly-data-summary/resource/2dc039d2-8da4-4ca9-95d8-1ba65e23815b

When using Droid 6.1, signature file version 65, it reports it is an OLE2 file, but not that it is an Excel document. It opens fine in LibreOffice, but not in oletool or xlrd. It contains images, so maybe that is foxing them?

https://bitbucket.org/decalage/oletools http://pypi.python.org/pypi/xlrd

Dclipsham commented 11 years ago

Hello David, Thanks for this submission. The part that is foxing DROID is the password protection. Protecting the workbook seems to (perhaps unsurprisingly) totally alter the internal structure of the file, so the bit that DROID is looking for in order to ID the file, isn't there or is obfuscated. I am able to recreate this issue by creating a new .xlsx file and password-protecting it.

I'll need to investigate further with a view to creating a new container signature to accommodate password protected Excel files. I'll also take the opportunity to observe differences with the rest of the MS Office format family.

David

davidread commented 11 years ago

David, Very interesting to hear about the password protection! Unfortunately I don't have a copy of Excel to try this with myself. Thanks for digging into this, and let me know if you have any luck deciphering the format in this case. David

Dclipsham commented 11 years ago

Hi David,

I've spent the day investigating. Encrypting Word and PowerPoint documents result in exactly the same internal structure as the example you have provided, and the same results via DROID (i.e. identification as OLE2 only)

The encrypted document conforms to the MS-OFFCRYPTO specification, as found here: http://msdn.microsoft.com/en-us/library/cc313071(v=office.12).aspx

Since our traditional method of identifying and distinguishing specific versions of OOXML documents (i.e. Word, Excel, PowerPoint) relies on seeking and analysing specific elements within the document, and in the case of encryption this information is obfuscated, it does not currently seem possible to determine with any specificity whether an encrypted file is either an Excel, Word, or PowerPoint document, apart from the file extension.

For the next signature release then (probably due early Feb), we'll be adding this in as a new format and modifying the container signature. If we discover a way to identify the underlying file type beyond the file extension, we can add this in at a later date.

Thanks again for raising this. It is important for us to recognise when an Office document has encryption in place, so this find has real value for us.

David

davidread commented 11 years ago

Great to hear it! It would be a bit rubbish if MS demands you decrypt the data before even telling you what program to open it in, but there'd not be much Droid can do about it in that case.

I'm just a little confused since when I open that example file in LibreOffice it opens a sheet fine. Are there other sheets in the file which are protected that LO isn't telling me about? Or is LO somehow decrypting it automatically?! (I don't have access to a recent version of Excel to compare.)

Dclipsham commented 11 years ago

Remember that in most cases, the file extension is correct, so in this instance the fact that the file has the .xslx extension, AND your machine has Libre Office associated with the extension, you don't encounter the difficulty. If the file extension were .xyz and you didn't have a compatible program associated with that extension, then you'd probably struggle, hence why DROID is useful in these instances, if at least to give you a clue as to the software to try to use.

It appears that, in this instance, elements of the workbook are protected from editing, rather than the whole document being encrypted.

In protecting the workbook, Excel seems to encrypt the entire document in the background, yet you are only prompted for a password should you attempt to change anything, at least in Excel. Libre Office (3.6) just tells me that a cell is protected and allows me to remove the protection and change content without a password prompt!

However, by encrypting the entire document in Excel with password protection, Libre Office does then prompt you for a password when you attempt to open it.

To confuse matters more, if you change the extension of this QDS file to .docx, then Word prompts you for the password before it will open, yet Libre Office just opens the file (in LO Writer, rather than Calc)

Word seems to behave differently. Protecting sections of a Word document does not seem to have the same effect as with Excel (i.e. altering the entire internal structure), however encrypting the document does.

It's a weird inconsistency, but very useful to encounter and attempt to understand.

davidread commented 11 years ago

Fantastic analysis here, David! Although we only have one file with this problem, no doubt this improvement will benefit others too.

At data.gov.uk we're using Droid to check files that are downloaded, and we are wary of filename extensions. Sometimes these don't come through, due to them using a CMS or have server headers issues. Or sometimes an error HTML page is returned, but with status 200 OK, so you receive "file.csv" which contains the error HTML. So Droid is proving invaluable for checking the actual content type.