artefactual / automation-tools

Tools to aid automation of Archivematica and AtoM.
GNU Affero General Public License v3.0
46 stars 33 forks source link

WIP: Enable duplicate detection via bag manifests #118

Open ross-spencer opened 5 years ago

ross-spencer commented 5 years ago

Compare an accruals location to an AIP store

This commit introduces an accruals->aips comparison capability.

Digital objects in an accruals folder can now be compared to the contents of an AIP store.

Where filepaths and checksums and dates match, the object is considered to be identical (a true duplicate). Where they don't, users can use modulo (%) to identify where the object isn't in fact identical.

Much of the benefit of this work is derived from the nature of the AIP structure imposed on a digital transfer.

Once the comparison is complete, three reports are output in CSV format:

Additionally a summary report output in JSON.

Connected to archivematica/issues#448

Configuration

API configuration, and transfer source location is done via this configuration file. Note the '"accruals_transfer_source"' parameter describes a transfer source in the storage service with the Description 'accruals'. But could equally be any other value more appropriate to your institution.

The primary script will also accept a value for this transfer source on the command line, e.g.

With everything configured correctly the successful output on the command line may look as follows:

$ python3 -m duplicates.accruals
INFO      2019-07-08 17:11:16 duplicates.py:171  No result for algorithm: md5
INFO      2019-07-08 17:11:16 duplicates.py:171  No result for algorithm: sha1
INFO      2019-07-08 17:11:17 duplicates.py:86   Filtering: data/METS.8a8e1cc5-82ec-491a-8bda-cc7d0223553f.xml
INFO      2019-07-08 17:11:17 duplicates.py:86   Filtering: data/README.html
INFO      2019-07-08 17:11:17 duplicates.py:86   Filtering: data/logs/fileFormatIdentification.log
INFO      2019-07-08 17:11:17 duplicates.py:86   Filtering: data/logs/filenameCleanup.log
INFO      2019-07-08 17:11:17 duplicates.py:82   Filtering: data/logs/transfers/1-465cca06-e0e2-4ee9-b2b6-ecc7feeed73a/logs/fileFormatIdentification.log
INFO      2019-07-08 17:11:17 duplicates.py:82   Filtering: data/logs/transfers/1-465cca06-e0e2-4ee9-b2b6-ecc7feeed73a/logs/filenameCleanup.log
INFO      2019-07-08 17:11:17 duplicates.py:82   Filtering: data/objects/metadata/transfers/1-465cca06-e0e2-4ee9-b2b6-ecc7feeed73a/directory_tree.txt
INFO      2019-07-08 17:11:17 duplicates.py:82   Filtering: data/objects/submissionDocumentation/transfer-1-465cca06-e0e2-4ee9-b2b6-ecc7feeed73a/METS.xml
INFO      2019-07-08 17:11:17 duplicates.py:171  No result for algorithm: sha512
INFO      2019-07-08 17:11:17 serialize_to_csv.py:41   Number of files in '1' AIPs in the AIP store: 4
INFO      2019-07-08 17:11:17 serialize_to_csv.py:44   Number of transfers: 3
INFO      2019-07-08 17:11:17 serialize_to_csv.py:47   Number of items in transfer 1: 4
INFO      2019-07-08 17:11:17 serialize_to_csv.py:47   Number of items in transfer 2: 5
INFO      2019-07-08 17:11:17 serialize_to_csv.py:47   Number of items in transfer 3: 2
{
    "count_of_files_across_aips": 4,
    "files_in_transfer-1": 4,
    "files_in_transfer-2": 5,
    "files_in_transfer-3": 2,
    "number_of_aips": 1,
    "numer_of_transfers": 3
}
ERROR     2019-07-08 17:11:17 serialize_to_csv.py:84   Outputting report to: true_duplicates_comparison.csv
ERROR     2019-07-08 17:11:17 serialize_to_csv.py:118  Outputting report to: near_matches_comparison.csv
ERROR     2019-07-08 17:11:17 serialize_to_csv.py:141  Outputting report to: non_matches_list.csv

The CSV files output as a result can then be used to compile a list of files specifically selected to be transferred into Archivematica.