DataVault / datavault

DataVault Project
MIT License
20 stars 16 forks source link

Performance issue in packager #392

Open gmh04 opened 6 years ago

gmh04 commented 6 years ago

Here are some times for creating bags with a small number of files (generally 1-3 files) on my workstation:

1GB - 00:26 5GB - 01:20 10GB - 02:60 50GB - 20:08 100GB - 42:34

gmh04 commented 6 years ago

A breakdown of the bagInPlace method of BagCreator https://github.com/LibraryOfCongress/bagit-java/blob/v5.0.3/src/main/java/gov/loc/repository/bagit/creator/BagCreator.java#L120-L137 shows 99% (or 134 seconds) of time of this function is createPayloadManifests. This creates the checksums of the files and compares with a manual command-line creation of a md5 checksum of 110 seconds.

Really the place for improvement here is to remove the unnecessary call to isValid https://github.com/DataVault/datavault/blob/5a2a7e097e20fe39cc3120d7051047c4b39202e0/datavault-worker/src/main/java/org/datavaultplatform/worker/operations/Packager.java#L51 which is taking another 130 seconds (it is redoing the lengthy checksums!). Any issues with bagInPlace should throw an exception, therefore we can assume the bag is valid if a Bag object is returned. This would half the time of createBag.