duke-libraries / dul-hydra

Duke Digital Repository Administrative Hydra Head
BSD 3-Clause "New" or "Revised" License
14 stars 1 forks source link

encrypted Office files detection #900

Closed laissezfarrell closed 8 years ago

laissezfarrell commented 10 years ago

As per Winston's link, Apache Tika or similar application would detect when an encrypted office file is submitted.

Students submitting ETDs to DukeSpace have submitted encrypted Excel documents in the past, and it is possible that born-digital archival holdings may include encrypted files.

coblej commented 10 years ago

@laissezfarrell What is the desired behavior when an encrypted Office file is detected? Reject the submission? Set a property indicating that the file is encrypted? Or ...?

laissezfarrell commented 10 years ago

That's a good question.

For students self-submitting papers, I'd like the submission to be rejected. With born-digital archival holdings, I'd like to know that a file is encrypted but would still want to preserve it. There's been some discussion in Rubenstein about whether we are within our rights to decrypt files if we have not been granted specific permission by the creator.

coblej commented 10 years ago

http://tika.apache.org

dchandekstark commented 10 years ago

Tika detects encrypted docs, but throws an exception:

Exception in thread "main" org.apache.tika.exception.EncryptedDocumentException: Unable to process: document is encrypted
    at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:245)
    at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
    at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:142)
    at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:418)
    at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:112)
dchandekstark commented 9 years ago

I would suggest wrapping this into a comprehensive file characterization regime. @jt112

jimtuttle commented 9 years ago

Sounds reasonable. Do you happen to know if any of the tools that FITS integrates with can identify encrypted files?

dchandekstark commented 9 years ago

I believe that recent versions of FITS incorporate tools with this capability, but need to confirm.

WinstonAtkins commented 9 years ago

In addition to Tika, you might be interested in "Tools for identifying obfuscated files, specifically password protected and encrypted formats?" from Digital Preservation Q&A: http://qanda.digipres.org/588/identifying-obfuscated-specifically-protected-encrypted.

dchandekstark commented 8 years ago

@laissezfarrell It still seems as of this date that catching the Tika EncryptedDocumentException is the best option.

dchandekstark commented 8 years ago

Moved to Jira.