Closed laissezfarrell closed 8 years ago
@laissezfarrell What is the desired behavior when an encrypted Office file is detected? Reject the submission? Set a property indicating that the file is encrypted? Or ...?
That's a good question.
For students self-submitting papers, I'd like the submission to be rejected. With born-digital archival holdings, I'd like to know that a file is encrypted but would still want to preserve it. There's been some discussion in Rubenstein about whether we are within our rights to decrypt files if we have not been granted specific permission by the creator.
Tika detects encrypted docs, but throws an exception:
Exception in thread "main" org.apache.tika.exception.EncryptedDocumentException: Unable to process: document is encrypted
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:245)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:142)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:418)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:112)
I would suggest wrapping this into a comprehensive file characterization regime. @jt112
Sounds reasonable. Do you happen to know if any of the tools that FITS integrates with can identify encrypted files?
I believe that recent versions of FITS incorporate tools with this capability, but need to confirm.
In addition to Tika, you might be interested in "Tools for identifying obfuscated files, specifically password protected and encrypted formats?" from Digital Preservation Q&A: http://qanda.digipres.org/588/identifying-obfuscated-specifically-protected-encrypted.
@laissezfarrell It still seems as of this date that catching the Tika EncryptedDocumentException is the best option.
Moved to Jira.
As per Winston's link, Apache Tika or similar application would detect when an encrypted office file is submitted.
Students submitting ETDs to DukeSpace have submitted encrypted Excel documents in the past, and it is possible that born-digital archival holdings may include encrypted files.