Open mbjones opened 1 year ago
tl;dr - can o' worms; needs prioritizing for further investigation.
At first glance, it seems like it would be ok to loosen sanitization restrictions, because speedbagit doesn't access files on disk. However, given the security implications, I'd be uncomfortable doing this without more in-depth investigation to determine whether this is true in all cases.
Since there are disallowed characters (and even disallowed filenames) that differ across operating systems, it would also take some time to determine which additional characters we could allow (or what subset we should still restrict). See this stackoverflow thread for context.
From a security perspective, the Bagit RFC says:
[5.1]. Special Directory Characters
The paths specified in the payload manifests, tag manifests, and
fetch files do not prohibit special directory characters that have
special meaning on some operating systems. Implementers MUST ensure
that files outside the bag directory structure are not accessed when
reading or writing files based on paths specified in a bag.
All implementations SHOULD have a test suite to guard against special
directory characters.
For example, a maliciously crafted "tagmanifest-sha512.txt" file
might contain entries that begin with a path character such as "/",
"..", or a "~username" home directory reference in an attempt to
cause a naive implementation to leak or overwrite targeted files on a
POSIX operating system.
Windows implementations SHOULD test their implementations to ensure
that safety checks prevent use of drive letters and the less commonly
used namespace sequences (e.g., "\\?\C:\...") described in [MSFNAM].
To assist implementers, the Library of Congress conformance suite
[LC-CONFORMANCE-SUITE] has some tests for invalid bags that are
expected to fail on POSIX or Windows clients.
The above-mentioned Library Of Congress bagit conformance suite looks like it might be useful in creating a more-robust test suite if we go down this path.
(the above-mentioned [MSFNAM] is a broken link.)
@mbjones - tagging you to make sure you saw this
Thanks, I did. My major concern is stripping out commonly used path characters like spaces, and standard Unicode characters used in other languages, such that paths become unrecognizable. If someone were able to process a directory hierarchy that they create via python or R on their machine, we should be able to preserve those paths so their scripts don't break. That's not to say that we don't have to sanitize, but I think our current approach is far too aggressive on what it strips out. As @rushirajnenuji finishes his work on displaying these paths in MetacatUI, this issue will become much more prominent (whereas for now its buried in an ORE doc or in a BagIt download).
adding for reference: The weird world of Windows file paths
It appears the
PackageDownloaderV2.sanitizeFilename
method and thePackageDownloaderV2.sanitizeFilepath
get used to only allow alphanumeric characters (and hyphens and periods) in paths and names in bags. This seems far too overly prescriptive, and is likely to lead to confusion and broken scripts if dataset authors expect names that might contain other punctuation or whitespace and it gets stripped out. @artntek as you are familiar with this code at the moment, what is your thought on this? Is there a security reason to sanitize? how about an illegal character reason for certain filesystems? Can we sanitize less?