NCEAS / metacat

Data repository software that helps researchers preserve, share, and discover data
https://knb.ecoinformatics.org/software/metacat
GNU General Public License v2.0
27 stars 13 forks source link

check if sanitize is appropriate in package downloader V2 (and V1?) #1617

Open mbjones opened 1 year ago

mbjones commented 1 year ago

It appears the PackageDownloaderV2.sanitizeFilename method and the PackageDownloaderV2.sanitizeFilepath get used to only allow alphanumeric characters (and hyphens and periods) in paths and names in bags. This seems far too overly prescriptive, and is likely to lead to confusion and broken scripts if dataset authors expect names that might contain other punctuation or whitespace and it gets stripped out. @artntek as you are familiar with this code at the moment, what is your thought on this? Is there a security reason to sanitize? how about an illegal character reason for certain filesystems? Can we sanitize less?

artntek commented 1 year ago

tl;dr - can o' worms; needs prioritizing for further investigation.

At first glance, it seems like it would be ok to loosen sanitization restrictions, because speedbagit doesn't access files on disk. However, given the security implications, I'd be uncomfortable doing this without more in-depth investigation to determine whether this is true in all cases.

Since there are disallowed characters (and even disallowed filenames) that differ across operating systems, it would also take some time to determine which additional characters we could allow (or what subset we should still restrict). See this stackoverflow thread for context.

From a security perspective, the Bagit RFC says:

[5.1].  Special Directory Characters

   The paths specified in the payload manifests, tag manifests, and
   fetch files do not prohibit special directory characters that have
   special meaning on some operating systems.  Implementers MUST ensure
   that files outside the bag directory structure are not accessed when
   reading or writing files based on paths specified in a bag.

   All implementations SHOULD have a test suite to guard against special
   directory characters.

   For example, a maliciously crafted "tagmanifest-sha512.txt" file
   might contain entries that begin with a path character such as "/",
   "..", or a "~username" home directory reference in an attempt to
   cause a naive implementation to leak or overwrite targeted files on a
   POSIX operating system.

   Windows implementations SHOULD test their implementations to ensure
   that safety checks prevent use of drive letters and the less commonly
   used namespace sequences (e.g., "\\?\C:\...") described in [MSFNAM].

   To assist implementers, the Library of Congress conformance suite
   [LC-CONFORMANCE-SUITE] has some tests for invalid bags that are
   expected to fail on POSIX or Windows clients.

The above-mentioned Library Of Congress bagit conformance suite looks like it might be useful in creating a more-robust test suite if we go down this path.

(the above-mentioned [MSFNAM] is a broken link.)

artntek commented 1 year ago

@mbjones - tagging you to make sure you saw this

mbjones commented 1 year ago

Thanks, I did. My major concern is stripping out commonly used path characters like spaces, and standard Unicode characters used in other languages, such that paths become unrecognizable. If someone were able to process a directory hierarchy that they create via python or R on their machine, we should be able to preserve those paths so their scripts don't break. That's not to say that we don't have to sanitize, but I think our current approach is far too aggressive on what it strips out. As @rushirajnenuji finishes his work on displaying these paths in MetacatUI, this issue will become much more prominent (whereas for now its buried in an ORE doc or in a BagIt download).

artntek commented 1 year ago

adding for reference: The weird world of Windows file paths