NCEAS / metacat

Data repository software that helps researchers preserve, share, and discover data
https://knb.ecoinformatics.org/software/metacat
GNU General Public License v2.0
26 stars 12 forks source link

MNodeService.getPackage() takes too long for large packages #1167

Closed mbjones closed 3 years ago

mbjones commented 6 years ago

Author Name: Chris Jones (Chris Jones) Original Redmine Issue: 7178, https://projects.ecoinformatics.org/ecoinfo/issues/7178 Original Date: 2017-03-29 Original Assignee: Chris Jones


When users click on the Download All button in MetacatUI, we call MN.getPackage() to zip up the members, create HTML and PDF metadata, etc. For very large packages, the packaging time is far too long for a decent user experience. To address this, we may need to provide some sort of progress API call that allows the client to get estimated packaging time and provide a progress bar for the user. Also, getPackage() uses File.createTempFile() to copy contents into a single directory tree for zipping (really BagIt bagging). This doesn't scale well for large packages (GBs). We can explore a few strategies to mitigate this. One that comes to mind is using hard symbolic links to the original data files in the directory tree rather than copying them. This needs some thought, but ultimately we need to speed up the packaging process for large packages.

mbjones commented 6 years ago

Original Redmine Comment Author Name: ben leinfelder (ben leinfelder) Original Date: 2017-03-29T16:18:36Z


If you can figure out how to reference files rather than copy them, I FULLY support that! I suppose the MN.getPackage() method just needs to pull directly from the Metacat filesystem and instead of File.createTempFile() we'd use Files.createSymbolicLink() or Files.createLink() as described here: https://docs.oracle.com/javase/tutorial/essential/io/links.html

The one drawback is that we wouldn't get to benefit from the access rule checking that is inherent in the current implementation that just repeatedly calls MN.get() to fetch the bytes of the data.

ThomasThelen commented 4 years ago

The real issue is that we have a dependency on the Library of Congress BagIt Library which only supports files on disk (so no passing streams to it). There's also the BagIt requirement that hashes are made from each file (tag and data manifest files), which requires a full read at some point so as long as BagIt is is in the picture, there's going to be at least one performance hit.

The solution that @csjx and I discussed involved forking the LoC BagIt Library and adding a method and overloads the existing methods to take streams. I haven't done a deep dive of the LoC code base, but it's a large enough change that warrants a separate pull request outside of the atLocation changes.

mbjones commented 4 years ago

We already have checksums in place in our system metadata, so you shouldn't technically have to recalculate those to make the BagIt manifests, and it is really inefficient to recalculate them.

I've been watching the work on contentid and hash-archive for DataONE packages as well, and I think natively supporting those would be great for DataONE -- in particular, keeping multiple checksum values in our metadata to support retrieval and cross-site replication detection. For example, here's the replica list for the Vostok ice core data file showing its location in many repositories (including some defunct locations):

https://hash-archive.org/sources/hash://sha256/9412325831dab22aeebdd674b6eb53ba6b7bdd04bb99a4dbb21ddff646287e37

mbjones commented 3 years ago

@ThomasThelen I assume your work on speedbagit will take care of this issue, right?

ThomasThelen commented 3 years ago

I think it should. It certainly fixes all of the file copying that's going. I'm making some last refractors to the unit tests for supporting Java 8 and then it's good to go (expecting to have travis working by the end of today).

ThomasThelen commented 3 years ago

Actually, I'm going to say that the solution to this problem is by replacing the current Metacat download implementation (let's call it v1) with SpeedBagIt. Once that's complete, this should no longer be an issue. From there, hierarchal package work can be included for a functioning v1 & v2 format.

ThomasThelen commented 3 years ago

This issue should be fixed in #1486. I think it would be okay to close this issue and revisit it if people still have issues. With the linked PR, the filesystem writes are no longer happening (the system metadata pdf technically still is but it shouldn't be creating a bottleneck).

ThomasThelen commented 3 years ago

Closing this now that we've confirmed that downloads are now being compressed. For large packages, this is a big performance boost.