crkn-rcdr / Digital-Preservation

Documentation and related schemas for the CRKN digital preservation system
3 stars 0 forks source link

Bad replication not detected by older Archive::BagIt #13

Closed RussellMcOrmond closed 3 years ago

RussellMcOrmond commented 4 years ago

Andreas Romeyke has taken over maintenance of https://metacpan.org/pod/Archive::BagIt

I have been working to integrate his work into our processes https://github.com/RussellMcOrmond/Archive-BagIt

I've detected a problem where older bags which have a manifest-crc32.txt are actually invalid as the manifest-crc32.txt files have never been touched by the more recent versions of Archive::BagIt. These crc32 manifests only list the older files, and not any of the newer files.

We will need to devise a process to fix the bags within the repository.

RussellMcOrmond commented 4 years ago

It isn't all bags with crc32.txt files which are invalid. It might be that specific versions of BagIt had this bug, or it may only relate to bags that were updated.

tdr@romano-repomanage:~$ ls -l /cihmz1/repository/aip/oocihm/526/oocihm.N_00022_19021204/
total 27
-rw-rw-r-- 1 tdr tdr  327 Jun 15  2017 bag-info.txt
-rw-rw-r-- 1 tdr tdr   54 Jun 15  2017 bagit.txt
drwxrwxr-x 4 tdr tdr    5 Jun 15  2017 data
-rw-rw-r-- 1 tdr tdr 1279 Jun 15  2017 manifest-crc32.txt
-rw-rw-r-- 1 tdr tdr 1895 Jun 15  2017 manifest-md5.txt
tdr@romano-repomanage:~$ bagit.pl verify /cihmz1/repository/aip/oocihm/526/oocihm.N_00022_19021204/
PASS: /cihmz1/repository/aip/oocihm/526/oocihm.N_00022_19021204/
tdr@romano-repomanage:~$ bagit.pl verify --fast /cihmz1/repository/aip/oocihm/526/oocihm.N_00022_19021204/               
PASS: /cihmz1/repository/aip/oocihm/526/oocihm.N_00022_19021204/
tdr@romano-repomanage:~$ 
RussellMcOrmond commented 4 years ago

The overnight bag check on Romano found an additional invalid bag.

oocihm.78214
oop.proc_SDC_1403_1
oop.proc_SDC_1802_1

Looking these up, all 3 were bagged in January 2020. It is possible that the problem is only with a recent version of BagIt, and not a problem that has existed for very long.

RussellMcOrmond commented 4 years ago
tdr@romano-repomanage:~$ bagit.pl verify --fast `tdr find oocihm.78214`
FAIL: /cihmz/repository/aip/oocihm/488/oocihm.78214 :  file: bag-info.txt invalid, digest (md5) calculated=c14828850ba595cfdf9ac86c4408c299, but expected=fa6ec4f664356bc05870b7f7b5658b8c in file '/cihmz/repository/aip/oocihm/488/oocihm.78214/tagmanifest-md5.txt' at /usr/local/share/perl/5.28.1/Archive/BagIt/Base.pm line 697.

tdr@romano-repomanage:~$ bagit.pl verify --fast `tdr find oop.proc_SDC_1403_1`
FAIL: /cihmz2/repository/aip/oop/642/oop.proc_SDC_1403_1 :  file: bag-info.txt invalid, digest (md5) calculated=dc1b636f17f83c14caf4a11971a0b348, but expected=07c7c37ac4d631e3fedc199b861eac98 in file '/cihmz2/repository/aip/oop/642/oop.proc_SDC_1403_1/tagmanifest-md5.txt' at /usr/local/share/perl/5.28.1/Archive/BagIt/Base.pm line 697.

tdr@romano-repomanage:~$ bagit.pl verify --fast `tdr find oop.proc_SDC_1802_1`
FAIL: /cihmz2/repository/aip/oop/292/oop.proc_SDC_1802_1 :  file: bag-info.txt invalid, digest (md5) calculated=fa6ec4f664356bc05870b7f7b5658b8c, but expected=ba04cd6e6bf36101eec347242cb7965e in file '/cihmz2/repository/aip/oop/292/oop.proc_SDC_1802_1/tagmanifest-md5.txt' at /usr/local/share/perl/5.28.1/Archive/BagIt/Base.pm line 697.

tdr@romano-repomanage:~$ 
RussellMcOrmond commented 4 years ago

I have looked, but can't figure out what would call https://github.com/RussellMcOrmond/Archive-BagIt/blob/60404de49ec5c74fc7368dcd1acd3b2a9881657b/lib/Archive/BagIt.pm#L172 sub _manifest_crc32 {}

russell@russell-XPS-13-7390:~/git/Archive-BagIt$ find . -type f -exec grep crc32 {} /dev/null \;
./lib/Archive/BagIt.pm:sub _manifest_crc32 {
./lib/Archive/BagIt.pm:    my $manifest_file = "$bagit/manifest-crc32.txt";
./lib/Archive/BagIt.pm:    open(my $fh, ">:encoding(utf8)",$manifest_file) or die("Cannot create manifest-crc32.txt: $!\n");
./lib/Archive/BagIt.pm:                my $digest = sprintf("%010d",crc32($DATA));
russell@russell-XPS-13-7390:~/git/Archive-BagIt$ 

All our calls for creating bags are of the form: Archive::BagIt->make_bag($string);

https://github.com/RussellMcOrmond/Archive-BagIt/blob/60404de49ec5c74fc7368dcd1acd3b2a9881657b/lib/Archive/BagIt.pm#L132 make_bag() doesn't read or write *-cfc32.txt files, so any that previously existed won't be updated or cleaned up from a bag that is being updated.

Looking at the source I'm now left confused how manifest-crc32.txt exists in /cihmz1/repository/aip/oocihm/526/oocihm.N_00022_19021204/ created in 2017 , while also missing tagmanifest-md5.txt given _tagmanifest_md5() was added to Archive::BagIt in April 2013.

As noted above, the verify error seems to relate to when the bag-info.txt file is stored (and which md5 it has), rather than the existence of the unused *-crc32.txt files.

RussellMcOrmond commented 4 years ago

Seems there are multiple problems being detected. The new BagIt seems to check the tag-manifestmd5.txt in way that the previous version didn't, detecting problems with replication that weren't otherwise detected.

I checked this AIP upstream from Romano and didn't see the same outdated version of the bag-info.txt file, so I forced it to replicate again. Then the bag verified without problem with the correctly dated file.

tdr@romano-repomanage:~$ bagit.pl verify --fast `tdr find oop.proc_SDC_1802_1`
PASS: /cihmz1/repository/aip/oop/292/oop.proc_SDC_1802_1
tdr@romano-repomanage:~$ ls -l /cihmz1/repository/aip/oop/292/oop.proc_SDC_1802_1
total 67
-rw-r--r-- 1 tdr tdr   109 Jan 10  2020 bag-info.txt
-rw-r--r-- 1 tdr tdr    54 Jan 10  2020 bagit.txt
drwxr-xr-x 3 tdr tdr     4 Jul 14 10:22 data
-rw-r--r-- 1 tdr tdr 74434 Jan 10  2020 manifest-md5.txt
-rw-r--r-- 1 tdr tdr   142 Jan 10  2020 tagmanifest-md5.txt
tdr@romano-repomanage:~$ 
RussellMcOrmond commented 4 years ago

Recording this:

tdr@romano-repomanage:~$ bagit.pl verify `tdr find oocihm.N_00693_18921012`
PASS: /cihmz2/repository/aip/oocihm/618/oocihm.N_00693_18921012
tdr@romano-repomanage:~$ bagit.pl verify --fast `tdr find oocihm.N_00693_18921012`
FAIL: /cihmz2/repository/aip/oocihm/618/oocihm.N_00693_18921012 :  file: manifest-md5.txt invalid, digest (md5) calculated=6beca68c48b9d37e18666e06a3bdbad6, but expected=ee6a183926af6023f68f328f4697d06f in file '/cihmz2/repository/aip/oocihm/618/oocihm.N_00693_18921012/tagmanifest-md5.txt' at /usr/local/share/perl/5.28.1/Archive/BagIt/Base.pm line 697.

tdr@romano-repomanage:~$ 
RussellMcOrmond commented 4 years ago

I've pushed a repomanage Docker image that has the new Archive::BagIt to all the ZFS nodes that use it for validation, so will see if other nodes have the same problem. Evidence is at the moment that validation failure doesn't relate to the incorrect manifest-crc32.txt file (which is never checked), but old validation not checking tagmanifest-md5.txt. This lead to a mismatch of any of the files in the root (such as bag-info.txt or tagmanifest-md5.txt itself) not being noticed during validation.

There were replication errors that weren't detected by validation.

RussellMcOrmond commented 4 years ago

Detected a bad tagmanifest-md5.txt on Swift for oocihm.N_00700_19130416 I updated the tagmanifest-md5.txt to be correct, and successfully updated the AIP on Romano. This problem will be detected on all other servers as well (eventually -- It takes almost 2 months to do a full validation).

RussellMcOrmond commented 4 years ago

Validation is periodically finding problem bags. I have been forcing a re-copy as follows:

russell@toma:~$ docker exec -it repomanage bash
root@toma-repomanage:/home/tdr# sudo -u tdr -i bash
tdr@toma-repomanage:~$ tdr find oocihm.8_04989
/cihmz/repository/aip/oocihm/830/oocihm.8_04989
tdr@toma-repomanage:~$ ls -l /cihmz/repository/aip/oocihm/830/oocihm.8_04989
total 51
-rw-rw-r-- 1 tdr tdr 109 Feb 12  2018 bag-info.txt
-rw-rw-r-- 1 tdr tdr  54 Dec 19  2019 bagit.txt
drwxrwxr-x 4 tdr tdr   6 Dec 19  2019 data
-rw-rw-r-- 1 tdr tdr 164 Nov  5  2014 manifest-crc32.txt
-rw-rw-r-- 1 tdr tdr 745 Dec 19  2019 manifest-md5.txt
-rw-rw-r-- 1 tdr tdr 195 Dec 19  2019 tagmanifest-md5.txt
tdr@toma-repomanage:~$ echo "a" >>/cihmz/repository/aip/oocihm/830/oocihm.8_04989/manifest-md5.txt 
tdr@toma-repomanage:~$ cat /etc/cron.d/repomanage 
MAILTO=sysadmin@c7a.ca
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/home/tdr/CIHM-TDR/bin:/home/tdr/CIHM-Swift/bin
PERL5LIB=/home/tdr/CIHM-TDR/lib:/home/tdr/CIHM-Swift/lib
# Repository Validation each evening
47 16 * * * tdr /bin/bash -c "date ; tdr verify --timelimit=43200 --maxprocs=8 ; date ; tdr walk ; date"
# Empty the trashcans every 6 hours
34 5,11,17,22 * * * tdr /bin/bash -c "find /cihmz*/repository/trashcan/ -mindepth 1 -maxdepth 1 -mmin +360 -exec rm -rf {} \;"
# Replication check every 10 minutes (find work and put in queue, then run rsync to add to repository)
*/10 * * * * tdr /bin/bash -c "tdr-replicationwork ; tdr-swiftreplicate --fromswift"
tdr@toma-repomanage:~$ tdr-replicationwork ; tdr-swiftreplicate --fromswift
tdr@toma-repomanage:~$ tdr find oocihm.8_04989
/cihmz1/repository/aip/oocihm/830/oocihm.8_04989
tdr@toma-repomanage:~$ ls -l /cihmz1/repository/aip/oocihm/830/oocihm.8_04989
total 35
-rw-r--r-- 1 tdr tdr 109 Dec 19  2019 bag-info.txt
-rw-r--r-- 1 tdr tdr  54 Dec 19  2019 bagit.txt
drwxr-xr-x 4 tdr tdr   6 Aug 11 09:36 data
-rw-r--r-- 1 tdr tdr 164 Nov  5  2014 manifest-crc32.txt
-rw-r--r-- 1 tdr tdr 745 Dec 19  2019 manifest-md5.txt
-rw-r--r-- 1 tdr tdr 195 Dec 19  2019 tagmanifest-md5.txt
tdr@toma-repomanage:~$ 

After the "echo" to change the md5 of the manifest, and before I force run of the replicate, I went to http://iris.tor.c7a.ca:5984/_utils/document.html?tdrepo/oocihm.8_04989%7Citem_repository.toma and removed the "manifest date" and "manifest md5" fields, and added a "replicate" field which I set to true (boolean, not string "true").

The difference with the new and old BagIt verification is that the tagmanifest-md5.txt is now properly being checked, and thus any of the files in this root directory that didn't match (in the above case, the bag-into.txt file) are noticed.

If we were keeping these bags long-term I would upgrade our scripts to clean out manifest-* files before updating the bag. As it is an outdated manifest-crc32.txt from a much older version of Archive::BagIt is left in place and not updated as the new library doesn't support the crc32 manifests.