Open gmh04 opened 7 years ago
A breakdown of the bagInPlace
method of BagCreator
https://github.com/LibraryOfCongress/bagit-java/blob/v5.0.3/src/main/java/gov/loc/repository/bagit/creator/BagCreator.java#L120-L137 shows 99% (or 134 seconds) of time of this function is createPayloadManifests
. This creates the checksums of the files and compares with a manual command-line creation of a md5 checksum of 110 seconds.
Really the place for improvement here is to remove the unnecessary call to isValid
https://github.com/DataVault/datavault/blob/5a2a7e097e20fe39cc3120d7051047c4b39202e0/datavault-worker/src/main/java/org/datavaultplatform/worker/operations/Packager.java#L51 which is taking another 130 seconds (it is redoing the lengthy checksums!). Any issues with bagInPlace
should throw an exception, therefore we can assume the bag is valid if a Bag
object is returned. This would half the time of createBag
.
Here are some times for creating bags with a small number of files (generally 1-3 files) on my workstation:
1GB - 00:26 5GB - 01:20 10GB - 02:60 50GB - 20:08 100GB - 42:34