This commit introduces an accruals->aips comparison capability.
Digital objects in an accruals folder can now be compared to the
contents of an AIP store.
Where filepaths and checksums and dates match, the object is
considered to be identical (a true duplicate). Where they don't,
users can use modulo (%) to identify where the object isn't in fact
identical.
Much of the benefit of this work is derived from the nature of the
AIP structure imposed on a digital transfer.
Once the comparison is complete, three reports are output in CSV
format:
True-duplicates.
Near-duplicates (checksums match, but other components might not).
Non-duplicates.
Additionally a summary report output in JSON.
Connected to archivematica/issues#448
Configuration
API configuration, and transfer source location is done via this configuration file. Note the '"accruals_transfer_source"' parameter describes a transfer source in the storage service with the Description 'accruals'. But could equally be any other value more appropriate to your institution.
The primary script will also accept a value for this transfer source on the command line, e.g.
With everything configured correctly the successful output on the command line may look as follows:
$ python3 -m duplicates.accruals
INFO 2019-07-08 17:11:16 duplicates.py:171 No result for algorithm: md5
INFO 2019-07-08 17:11:16 duplicates.py:171 No result for algorithm: sha1
INFO 2019-07-08 17:11:17 duplicates.py:86 Filtering: data/METS.8a8e1cc5-82ec-491a-8bda-cc7d0223553f.xml
INFO 2019-07-08 17:11:17 duplicates.py:86 Filtering: data/README.html
INFO 2019-07-08 17:11:17 duplicates.py:86 Filtering: data/logs/fileFormatIdentification.log
INFO 2019-07-08 17:11:17 duplicates.py:86 Filtering: data/logs/filenameCleanup.log
INFO 2019-07-08 17:11:17 duplicates.py:82 Filtering: data/logs/transfers/1-465cca06-e0e2-4ee9-b2b6-ecc7feeed73a/logs/fileFormatIdentification.log
INFO 2019-07-08 17:11:17 duplicates.py:82 Filtering: data/logs/transfers/1-465cca06-e0e2-4ee9-b2b6-ecc7feeed73a/logs/filenameCleanup.log
INFO 2019-07-08 17:11:17 duplicates.py:82 Filtering: data/objects/metadata/transfers/1-465cca06-e0e2-4ee9-b2b6-ecc7feeed73a/directory_tree.txt
INFO 2019-07-08 17:11:17 duplicates.py:82 Filtering: data/objects/submissionDocumentation/transfer-1-465cca06-e0e2-4ee9-b2b6-ecc7feeed73a/METS.xml
INFO 2019-07-08 17:11:17 duplicates.py:171 No result for algorithm: sha512
INFO 2019-07-08 17:11:17 serialize_to_csv.py:41 Number of files in '1' AIPs in the AIP store: 4
INFO 2019-07-08 17:11:17 serialize_to_csv.py:44 Number of transfers: 3
INFO 2019-07-08 17:11:17 serialize_to_csv.py:47 Number of items in transfer 1: 4
INFO 2019-07-08 17:11:17 serialize_to_csv.py:47 Number of items in transfer 2: 5
INFO 2019-07-08 17:11:17 serialize_to_csv.py:47 Number of items in transfer 3: 2
{
"count_of_files_across_aips": 4,
"files_in_transfer-1": 4,
"files_in_transfer-2": 5,
"files_in_transfer-3": 2,
"number_of_aips": 1,
"numer_of_transfers": 3
}
ERROR 2019-07-08 17:11:17 serialize_to_csv.py:84 Outputting report to: true_duplicates_comparison.csv
ERROR 2019-07-08 17:11:17 serialize_to_csv.py:118 Outputting report to: near_matches_comparison.csv
ERROR 2019-07-08 17:11:17 serialize_to_csv.py:141 Outputting report to: non_matches_list.csv
The CSV files output as a result can then be used to compile a list of files specifically selected to be transferred into Archivematica.
Compare an accruals location to an AIP store
This commit introduces an accruals->aips comparison capability.
Digital objects in an accruals folder can now be compared to the contents of an AIP store.
Where filepaths and checksums and dates match, the object is considered to be identical (a true duplicate). Where they don't, users can use modulo (%) to identify where the object isn't in fact identical.
Much of the benefit of this work is derived from the nature of the AIP structure imposed on a digital transfer.
Once the comparison is complete, three reports are output in CSV format:
Additionally a summary report output in JSON.
Connected to archivematica/issues#448
Configuration
API configuration, and transfer source location is done via this configuration file. Note the '"accruals_transfer_source"' parameter describes a transfer source in the storage service with the Description 'accruals'. But could equally be any other value more appropriate to your institution.
The primary script will also accept a value for this transfer source on the command line, e.g.
python3 -m duplicates.accruals <my_transfer_source_description>
With everything configured correctly the successful output on the command line may look as follows:
The CSV files output as a result can then be used to compile a list of files specifically selected to be transferred into Archivematica.