crkn-rcdr / Digital-Preservation

Documentation and related schemas for the CRKN digital preservation system
3 stars 0 forks source link

CRKN repository analysis #9

Closed RussellMcOrmond closed 4 years ago

RussellMcOrmond commented 5 years ago

A document describing the repository analysis was created. This issue will be used for tracking the steps for answering the questions.

RussellMcOrmond commented 5 years ago

http://workflow.canadiana.ca/demo/METScount.html and http://workflow.canadiana.ca/demo/metsdupmd5.html set up to allow staff to view AIPs with specific numbers of METS and which have duplicate MD5's for the METS records.

RussellMcOrmond commented 5 years ago

As of today, there are 6180 AIPs where there are revision files that aren't duplicates of files within the SIP.

Running the "walkmd5" found that only 67 of these AIPs have revisions that are in a one of the other SIPs of the 11092 AIPs with revision files.

I'm now loading in the SIP data for the 307227 other AIPs that didn't have any revisions, to determine if the "unique" files are actually in one of these other SIPs.

russell@russell-XPS-13-9370:~$ curl -s -X GET "http://10.200.1.58:5984/repoanalysis/_design/ra/_view/dupinother?reduce=false" | grep '^{"id' | sed -e 's/^.*"id":"\([^"]*\).*$/\1/'
oocihm.8_06548_101
oocihm.8_06548_80
oocihm.8_06548_107
oocihm.8_06548_108
oocihm.8_06548_109
oocihm.8_06548_98
ooe.b4225223_231
ooe.b4225223_244
ooe.b4225223_248
ooe.b4225223_253
ooe.b4225223_257
ooe.b4225223_259
ooe.b4225223_224
ooe.b4225223_227
ooe.b4225223_236
ooe.b4225223_239
ooe.b4225223_247
ooe.b4225223_249
ooe.b4225223_250
ooe.b4225223_251
ooe.b4225223_252
ooe.b4225223_254
ooe.b4225223_255
ooe.b4225223_258
ooe.b4225223_260
ooe.b4225223_261
ooe.b4225223_262
ooe.b4225223_263
ooe.b4225223_267
ooe.b4225223_268
ooe.b4225223_272
ooe.b4225223_223
ooe.b4225223_233
ooe.b4225223_234
ooe.b4225223_235
ooe.b4225223_238
ooe.b4225223_242
ooe.b4225223_246
ooe.b4225223_256
ooe.b4225223_264
ooe.b4225223_266
ooe.b4225223_226
ooe.b4225223_237
ooe.b4225223_243
ooe.b4225223_265
ooe.b4225223_269
ooe.b4225223_270
ooe.b4225223_225
ooe.b4225223_230
ooe.b4225223_241
ooe.b4225223_229
ooe.b4225223_228
ooe.b4225223_245
ooe.b4225223_271
ooe.b4225223_240
ooe.b4225223_232
ooe.b3218570
ooe.b3218594
ooe.b3750656
ooe.b3750668
oop.debates_SOC1901_01
oocihm.lac_reel_t6951
oop.SOC_3402_130_02
oocihm.lac_reel_t6952
oop.debates_HOC3201_20
oop.debates_CDC3201_20
oocihm.lac_reel_t18225
russell@russell-XPS-13-9370:~$ 
RussellMcOrmond commented 5 years ago

I moved the processing of JHOVE reports out of CIHM::Meta, and the creation from CIHM::WIP. The intention is to move the report generation and processing to this project.

One of the early steps was to copy the existing JHOVE reports out of CouchDB to put them in Swift. The CouchDB database (documents and the JHOVE XML files as attachments) was 239.7 GB in size. After copying the 61,773,055 JHOVE XML attachments to Swift, it is taking up 1.32TB of space.

Turns out that repository analysis became a learning moment about Swift and the overhead of storing a large number of small files. I am redesigning the tools to make use of Archive::Zip to group these XML reports together (1 zip per AIP ID, rather than 1 per file).

RussellMcOrmond commented 4 years ago

Two tools have been running: one to delete virtual folders when a zip file already exists, and one that was storing a zip with all reports. The tool storing the zip files is complete, and it is only the tool removing objects in virtual folders that is running.

I'm a bit surprised by the current storage: count=56923975 and size=2.61 TB

RussellMcOrmond commented 4 years ago

A decision about what policy to use for migration of deposited images to Archivematica has been made, and will be presented to the PAC in January. Repository analysis is on-hold until there are new questions that we need to answer.