NCEAS / metacat

Data repository software that helps researchers preserve, share, and discover data
https://knb.ecoinformatics.org/software/metacat
GNU General Public License v2.0
26 stars 12 forks source link

getPackage implementation doesn't handle duplicate filenames #1443

Closed amoeba closed 1 year ago

amoeba commented 4 years ago

While doing testing on https://github.com/NCEAS/metacatui/issues/1397, I noticed some inconsistent behavior with the getPackage implementation: If two objects in a package share the same fileName property in sysmeta, you don't see both objects in the data folder or in the manifest-{algo}.txt file. But you do see both in the pid-mapping.txt file. A quick glance at the code backs this up.

AFAIK there's nothing codified in the DataONE API as to what the expected behavior is. This is definitely an edge case but I think we can at least do something like myfile, myfile (1), etc or myfile-{pid1}, myfile-{pid2}, etc when we have duplicates.

mbjones commented 4 years ago

Thanks @amoeba for the report. This is a special case of the larger problem of file naming consistency in user-supplied data packages. We also need to be watching out for cases where the filename is not legal on a given platform, or even is maliciously crafted to exploit security bugs. So, all package naming in both our get() and getPackage() operations needs to use a filename mangling routine that deals with these use cases consistently. At a minimum, deal with:

1) filename uses illegal characters 2) filename is too long for platform 3) filename is not unique 4) filename is/isn't unique on case-insensitive filesystems 5) filename has malicious payload designed to exploit weaknesses

Probably others I am not thinking of. At least some of these I think are already handled in get() when setting Content-Disposition headers I think.

ThomasThelen commented 3 years ago

I can definitely handle this case in SpeedBagIt, which currently throws java.util.zip.ZipException: duplicate entry when two files are written to the same location. This behavior is largely undefined when it comes to the BagIt standard and I don't think that the library can do much else.

Solution 1: Don't fix it

Catch the exception from SpeedBagIt and re-throw to the user letting them know that they have a package that is skewed (they're making two conflicting claims about a file). The problem with this is that it's a hassle/burden on the user to have to fix the sysmeta and issues arise if one of the files is used in another package where the path is perfectly valid in that context.

Solution 2: Check the filenames before adding to SpeedBagIt

As @amoeba mentioned we can rename the files to something that we think is sensible. Renaming the file will break consistency with the sysmeta document that's also stored in the bag which might be considered a bug, but at least the user is actually getting the data with slightly misleading metadata. This may be an issue if bags are ever consumed by DataONE because there would be that mismatch of filenames on disk and filenames claimed by the sysmeta documents (The resource map is of course connected to the sysmeta annd potentially connected to the file via prov:atLocation. The sysmeta is potentially connected to the file via the filename property).

If we agree on a convention for renaming the files as they're placed in the bag I can add that check. It sounds like the rules are

If the filenames match, take the second and

  1. Remove the extension
  2. Add a space
  3. Use (1)...(n)
  4. Append the extension
amoeba commented 3 years ago

It's pretty common for clients in these scenarios to receive a suggested file name but do something else based upon context (ie web browsers). I think the same reasoning applies here. Plus, fileName is defined as a "value providing a suggested file name for the object" and nothing more.

The authoritative mapping between a file in a Bag and its system metadata is its PID and the respective line in the pid-mapping.txt file so differences between any object's fileName property and its location on disk can be handled there. Plus, fileName could change after Bag creation so our approach should be tolerant of this.

I favor Solution 2 with sub-option (3).

amoeba commented 2 years ago

While doing some testing for https://github.com/NCEAS/metacatui/issues/1901, I ran into what looks like a regression in getPackage since the changes that went into to support SpeedBagIt as the underlying BagIt implementation. When a Data Package contains members with identical fileName properties in System Metadata, calls togetPackage hang indefinitely.

I haven't done a lot of testing here but I can reproduce this on test.arcticdata.io, which is running 2.15.1, by creating a package from R like:

library(dataone)
library(arcticdatautils)
# remember to set your auth token
testarctic <- MNode("https://test.arcticdata.io/metacat/d1/mn/v2")

# create the package
dummy_package <- create_dummy_package(testarctic, size = 3)

# make the data files all have the same fileName of "data"
vapply(dummy_package$data, function(id) { set_file_name(testarctic, id, "data")}, TRUE)

Then, a call like https://test.arcticdata.io/metacat/d1/mn/v2/packages/application%2Fbagit-097/urn%3Auuid%3A25029db7-fac8-4f00-a7bb-bab306685680 just hangs indefinitely. When I run it with curl, the response never starts.

ThomasThelen commented 2 years ago

Hmm interesting. I know I let java.util.zip.ZipException throw when there's a duplicate (this is built into the zip library). I just created a new branch with a unit test that checks this here.

What's weird is that the test fails

Screen Shot 2021-10-18 at 5 36 09 PM

What I'm confused about is that I'm catching IOException here, and ZipException is just a type of IOException,

Screen Shot 2021-10-18 at 5 39 23 PM

Any ideas why that might be? At the end of the day, Metacat should be responding to this exception

edit: After writing this post I noticed I was still failing the unit test on IOExeption-I pushed a second commit fixing that (but it still fails)

ThomasThelen commented 2 years ago

As an update, @amoeba noticed the unit test above is actually hanging, possibly a deadlock with the piped stream. I've filed an issue here to track the progress of the fix.

mbjones commented 1 year ago

Speed-bagit 1.0.3 now throws an exception when a request to add a duplicate filename is made -- see https://github.com/DataONEorg/speed-bagit/issues/27 and https://github.com/DataONEorg/speed-bagit/issues/23. So now, it is Metacat's responsibility to handle the issue when a duplicate occurs. For example, when Metacat is creating a bagit file, an exception on addFile will need to be addressed, either by aborting the request altogether and sending back an appropriate error, or by modifying the filename and adding it to the bag (possibly with some sort of an error indication in the bag). I think the latter case is more useful to end users but its not a real representation of the data package. Still. @taojing2002 let's discuss getting this in the next metacat release please.

mbjones commented 1 year ago

I started a new branch, feature-1443-duplicate-filenames, for this work that uses the speed-bagit 1.0.3 release. I think I caught the places where files are added to speedbagit. It should now attempt to rename the files once before giving up on duplicates. If for some reason the bag creation still fails on duplicates, we should now throw a service failure, so we should no longer be in a continuous loop, or fail silently.

@taojing2002 @artntek I did not have an adequate build and test setup to test these metacat changes adequately, so consider this branch a starting point for testing and documentation. Could you run the tests? I started PR #1614 as a place to record issues and comments on the code changes.

artntek commented 1 year ago

See PR #1614