During preprocessing, this PR checks if a bag exists in AP Trust and Wasabi S3 bucket. It compares the hash of the current item version being prepared for bagging with the item's version hash in AP Trust if the item version has already been preserved. The article version will be skipped if a match is found else its bag will be updated. All activities are logged.
NOTE: This feature may sometimes put a name other than the first author's name in the eventual preservation package file due to the metadata sorting during metadata hash computation.
PROPOSED SOLUTION: Ignore authors' list during sorting while computing metadata hash. This is not included in this PR.
See #93
Documentation Update
[X] I have updated README.md and other relevant documentation
[ ] No documentation update is needed
Implementation Notes
This PR contains Utils.py in the figshare directory which houses utility functions. The following functions are available in this PR:
standardize_api_result: This standardizes results from APIs. It replaces 'null' with an empty string to have a constant value for empty fields in API results.
sorter_api_result: This recursively sorts results from APIs. The sorting ensures a consistent arrangement of fields in API results to facilitate the generation of a constant hash for the result at any time.
get_preserved_version_hash_and_size: This connects to AP Trust to extract hash and size of an already preserved article version if one exists.
compare_hash: This compares the hash value of the current article version in pre-processing with the hash of the article version in AP Trust.
check_wasabi: This checks if an article version has already been preserved in Wasabi S3 bucket.
Bag checks are carried out in Article.py and Collection.py inside the figshare directory. Logging is done in app.py
Description
During preprocessing, this PR checks if a bag exists in AP Trust and Wasabi S3 bucket. It compares the hash of the current item version being prepared for bagging with the item's version hash in AP Trust if the item version has already been preserved. The article version will be skipped if a match is found else its bag will be updated. All activities are logged.
NOTE: This feature may sometimes put a name other than the first author's name in the eventual preservation package file due to the metadata sorting during metadata hash computation.
PROPOSED SOLUTION: Ignore authors' list during sorting while computing metadata hash. This is not included in this PR.
See #93
Documentation Update
Implementation Notes
This PR contains Utils.py in the figshare directory which houses utility functions. The following functions are available in this PR:
Bag checks are carried out in Article.py and Collection.py inside the figshare directory. Logging is done in app.py