Closed mbjones closed 3 years ago
Original Redmine Comment Author Name: ben leinfelder (ben leinfelder) Original Date: 2017-03-29T16:18:36Z
If you can figure out how to reference files rather than copy them, I FULLY support that! I suppose the MN.getPackage() method just needs to pull directly from the Metacat filesystem and instead of File.createTempFile() we'd use Files.createSymbolicLink() or Files.createLink() as described here: https://docs.oracle.com/javase/tutorial/essential/io/links.html
The one drawback is that we wouldn't get to benefit from the access rule checking that is inherent in the current implementation that just repeatedly calls MN.get() to fetch the bytes of the data.
The real issue is that we have a dependency on the Library of Congress BagIt Library which only supports files on disk (so no passing streams to it). There's also the BagIt requirement that hashes are made from each file (tag and data manifest files), which requires a full read at some point so as long as BagIt is is in the picture, there's going to be at least one performance hit.
The solution that @csjx and I discussed involved forking the LoC BagIt Library and adding a method and overloads the existing methods to take streams. I haven't done a deep dive of the LoC code base, but it's a large enough change that warrants a separate pull request outside of the atLocation changes.
We already have checksums in place in our system metadata, so you shouldn't technically have to recalculate those to make the BagIt manifests, and it is really inefficient to recalculate them.
I've been watching the work on contentid
and hash-archive for DataONE packages as well, and I think natively supporting those would be great for DataONE -- in particular, keeping multiple checksum values in our metadata to support retrieval and cross-site replication detection. For example, here's the replica list for the Vostok ice core data file showing its location in many repositories (including some defunct locations):
@ThomasThelen I assume your work on speedbagit will take care of this issue, right?
I think it should. It certainly fixes all of the file copying that's going. I'm making some last refractors to the unit tests for supporting Java 8 and then it's good to go (expecting to have travis working by the end of today).
Actually, I'm going to say that the solution to this problem is by replacing the current Metacat download implementation (let's call it v1) with SpeedBagIt. Once that's complete, this should no longer be an issue. From there, hierarchal package work can be included for a functioning v1 & v2 format.
This issue should be fixed in #1486. I think it would be okay to close this issue and revisit it if people still have issues. With the linked PR, the filesystem writes are no longer happening (the system metadata pdf technically still is but it shouldn't be creating a bottleneck).
Closing this now that we've confirmed that downloads are now being compressed. For large packages, this is a big performance boost.
Author Name: Chris Jones (Chris Jones) Original Redmine Issue: 7178, https://projects.ecoinformatics.org/ecoinfo/issues/7178 Original Date: 2017-03-29 Original Assignee: Chris Jones
When users click on the
Download All
button in MetacatUI, we callMN.getPackage()
to zip up the members, create HTML and PDF metadata, etc. For very large packages, the packaging time is far too long for a decent user experience. To address this, we may need to provide some sort of progress API call that allows the client to get estimated packaging time and provide a progress bar for the user. Also,getPackage()
usesFile.createTempFile()
to copy contents into a single directory tree for zipping (really BagIt bagging). This doesn't scale well for large packages (GBs). We can explore a few strategies to mitigate this. One that comes to mind is using hard symbolic links to the original data files in the directory tree rather than copying them. This needs some thought, but ultimately we need to speed up the packaging process for large packages.