archivematica / Issues

Issues repository for the Archivematica project
GNU Affero General Public License v3.0
16 stars 1 forks source link

Problem: newer pipelines may not be able to handle AIPs made with older pipelines #24

Closed jrwdunham closed 5 years ago

jrwdunham commented 6 years ago

Please describe the problem you'd like to be solved.

Sometimes an institution has been creating AIPs for a long time using various versions of Archivematica and the Storage Service and the institution wants to make sure that those AIPs are usable by their modern (currently installed) Archivematica version, e.g., in order to perform additional preservation actions like adding metadata.

Without modifying the content of such older AIPs, they may need to be compressed and given pointer files that document that compression.

Old AIPs may also have METS files that lack namespaces in their XML elements. It is unclear whether a modern Archivematica pipeline knows how to read such AIPs. Archivematica's METS interpretation functionality (in particular mets-reader-writer) may need to be modified in order to handle such AIPs.

Old AIPs may have accidentally been deleted from a Storage Service database. Their database tables in the SS db may need to be reconstructed from their METS files.

Similarly, old AIPs may have been created on a pipeline that has since been destroyed. Such AIPs may need their SS and MCP (pipeline) database tables reconstructed.

Describe the solution you'd like to see implemented.

I would like to see a case study of the techniques used (or attempted) to make Archivematica flexible and accommodating to various types of vintage AIP. I expect AIP re-ingest may be useful in compressing and generating pointer files for older AIPs, but API calls, Django management commands, and bespoke import/modernize scripts may all be necessary. I expect to see examples of using the AM and SS GUIs and APIs to interact with older AIPs and to discover whether older AIPs can be re-ingested.

Additional context

Imagine you had this:

Compressed? Pointer file? Namespaced METS? Typical vintage
0 0 1 AM >=1.2 (circa 2017 - 2018)
1 0 0 AM < 1.0 (2012-2013) / AM 1.0 - 1.1 (ca. 2014)
1 0 1 AM >= 1.2 (ca. 2015 - 2016)

How would you get to this:?

Compressed? Pointer file? Namespaced METS? Typical vintage
1 1 1 AM = 1.7 (2018)
jrwdunham commented 6 years ago

Progress Update

@jhsimpson @sromkey

Three development branches in three different repos have been created to deal with this issue:

  1. SS dev/issue-24-handle-old-aips
  2. AM dev/issue-24-handle-old-aips
  3. metsrw dev/issue-24-handle-old-aips

The most significant of these is the SS one, which introduces the import_aip Django management command which can import an existing (vintage) AIP into a running Storage Service instance, resulting in a compressed AIP with a pointer file. See import_aip

The general methodology employed here was to use import_aip to import exemplars of the various types of vintage AIP (i.e., types 0.0.1, 1.0.0, and 1.0.1) and then perform various actions and tests—in particular AIP re-ingest—to test whether Archivematica is able to handle these AIPs correctly.

Experiment 1: 0.0.1: uncompressed, no pointer file, METS namespaced

Experiment 2: 1.0.0: compressed, no pointer file, METS not namespaced

Experiment 3: 1.0.1: compressed, no pointer file, METS namespaced

sevein commented 5 years ago

@hakamine, I've updated JD's branches with a number of fixes. The main changes to support old METS:

sevein commented 5 years ago

This is the latest re-ingested METS (using AM68's AIP): METS.xml.

There is a known issue with the validator:

[...]
ERROR ON LINE 2457: Element '{info:lc/xmlns/premis-v2}agent', attribute 'version': [facet 'enumeration'] The value '2.2' is not an element of the set {'2.0', '2.1'}.
ERROR ON LINE 2457: Element '{info:lc/xmlns/premis-v2}agent', attribute 'version': '2.2' is not a valid value of the atomic type '{info:lc/xmlns/premis-v2}versionSimpleType'.
ERROR ON LINE 2471: Element '{info:lc/xmlns/premis-v2}agent', attribute 'version': [facet 'enumeration'] The value '2.2' is not an element of the set {'2.0', '2.1'}.
ERROR ON LINE 2471: Element '{info:lc/xmlns/premis-v2}agent', attribute 'version': '2.2' is not a valid value of the atomic type '{info:lc/xmlns/premis-v2}versionSimpleType'.

I haven't found a way to combine in the same document elements using premis-v2-1.xsd and premis-v2-2.xsd. The validator seems to associate the namespace (info:lc/xmlns/premis-v2) only to the schema location appearing first.

CC @hakamine @evelynPM

hakamine commented 5 years ago

Hi @sevein I have a question re the error output shown in the previous comment. Do the errors show in the output of the import_aip command or the output of a different tool? I tried the import command on the same AIP and got a "successfully imported" message (no error messages). I am using: AM: dev/issue-24-handle-old-aips 20e2977 and SS: dev/issue-24-handle-old-aips f3fd8dd

sevein commented 5 years ago

Hi @hakamine! It's good news that it's working for you. The error mentioned above is only seen during the validation of the METS document generated by reingest of the imported AIP.

The validation issue has not been solved yet, unfortunately. We're still trying to find the best way to go. The root cause seems to be that you can only associate a XML prefix to a namespace once. E.g. if premis is used once and it points to premis-v2-1.xsd then that schema is used by the validator at all times regardless the fact that we are declaring different namespaces later on for the same prefix. So when we combine premis v2.1 + v2.2 (or v2.1 + v3.x as https://github.com/archivematica/Issues/issues/370 suggests), the validator errors out. I've thought maybe we could use different prefixes, e.g. premis and premis3, but that may not be a good idea.

hakamine commented 5 years ago

When trying to index the AM68 AIP, getting the following error:

...
Rebuilding AIP UUID bdcb560d-7ddd-4c13-8040-1e565b4eddff
Processing AIP bdcb560d-7ddd-4c13-8040-1e565b4eddff
Deleting AIP bdcb560d-7ddd-4c13-8040-1e565b4eddff from aips/aip and aips/aipfile.
Command to run: ['unar', '-force-overwrite', '-o', '/tmp/tmpGzBMkl', '/mnt/siptransfer01-cvan110/aip
store-test-g543/bdcb/560d/7ddd/4c13/8040/1e56/5b4e/ddff/AM68-bdcb560d-7ddd-4c13-8040-1e565b4eddff.7z
', 'AM68-bdcb560d-7ddd-4c13-8040-1e565b4eddff/data/METS.bdcb560d-7ddd-4c13-8040-1e565b4eddff.xml']
/mnt/siptransfer01-cvan110/aipstore-test-g543/bdcb/560d/7ddd/4c13/8040/1e56/5b4e/ddff/AM68-bdcb560d-
7ddd-4c13-8040-1e565b4eddff.7z: 7-Zip
  AM68-bdcb560d-7ddd-4c13-8040-1e565b4eddff/data/METS.bdcb560d-7ddd-4c13-8040-1e565b4eddff.xml  (715
76 B)... OK.
Successfully extracted to "/tmp/tmpGzBMkl/AM68-bdcb560d-7ddd-4c13-8040-1e565b4eddff".
Removed FITS output from METS.
Traceback (most recent call last):
  File "manage.py", line 10, in <module>
    execute_from_command_line(sys.argv)
  File "/usr/share/archivematica/virtualenvs/archivematica-dashboard/local/lib/python2.7/site-packag
es/django/core/management/__init__.py", line 354, in execute_from_command_line
    utility.execute()
  File "/usr/share/archivematica/virtualenvs/archivematica-dashboard/local/lib/python2.7/site-packag
es/django/core/management/__init__.py", line 346, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/usr/share/archivematica/virtualenvs/archivematica-dashboard/local/lib/python2.7/site-packag
es/django/core/management/base.py", line 394, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/usr/share/archivematica/virtualenvs/archivematica-dashboard/local/lib/python2.7/site-packag
es/django/core/management/base.py", line 445, in execute
    output = self.handle(*args, **options)
  File "/opt/archivematica/archivematica/src/dashboard/src/main/management/commands/rebuild_elastics
earch_aip_index_from_files.py", line 295, in handle
    delete_existing_data=options['delete'],
  File "/opt/archivematica/archivematica/src/dashboard/src/main/management/commands/rebuild_elastics
earch_aip_index_from_files.py", line 169, in processAIPThenDeleteMETSFile
    size=aip_info[0]['size'],
  File "/usr/lib/archivematica/archivematicaCommon/elasticSearchFunctions.py", line 425, in index_ai
p
    mets_created_attr = mets_hdr.get('CREATEDATE')
AttributeError: 'NoneType' object has no attribute 'get'

It seems that the script is looking for an attribute that is not present in the METS file

sevein commented 5 years ago

@hakamine thanks for the report. I've just added a new commit to handle that scenario where metsHdr is missing. See https://github.com/artefactual/archivematica/commit/ce1579f2b4c677b3f25e4e44deefa5512aac00d2.

hakamine commented 5 years ago

Thank you @sevein for the fix. I tested it and the indexing of management the AIP "AM68" (using the custom management command rebuild_elasticsearch_aip_index_from_files) finished without errors this time. However, I am getting errors when trying to download individual files from the AIP from the archival storage tab:

screen shot 2019-01-07 at 3 10 24 pm

When I try to download any of the files, getting an internal server error, the dashboard log shows a message like:

ERROR     2019-01-07 23:11:02  django.request:base:handle_uncaught_exception:256:  Internal Server Error: /archival-storage/download/aip/file/431913ba-4379-4373-8798-cc5f2b9dd769/
Traceback (most recent call last):
 File "/usr/share/archivematica/virtualenvs/archivematica-dashboard/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 132, in get_response
   response = wrapped_callback(request, *callback_args, **callback_kwargs)
 File "/opt/archivematica/archivematica/src/dashboard/src/components/archival_storage/views.py", line 318, in aip_file_download
   file = models.File.objects.get(uuid=uuid)
 File "/usr/share/archivematica/virtualenvs/archivematica-dashboard/local/lib/python2.7/site-packages/django/db/models/manager.py", line 127, in manager_method
   return getattr(self.get_queryset(), name)(*args, **kwargs)
 File "/usr/share/archivematica/virtualenvs/archivematica-dashboard/local/lib/python2.7/site-packages/django/db/models/query.py", line 334, in get
   self.model._meta.object_name
DoesNotExist: File matching query does not exist.

(not sure if this could be caused by the "vintage" METS or by a bug in the reindex command script that affects all AIPs. If you think it could be the latter let me know and I'll open a separate github issue)

sevein commented 5 years ago

Thanks, I'll take a look!

sevein commented 5 years ago

Progress update!

Known issues:

hakamine commented 5 years ago

@sevein would it be ok if I rebase AM dev/issue-24-handle-old-aips on top of current qa/1.x ? I would like to check if the ES upgrade changes fix the problem with the indexing of AIP files.

sevein commented 5 years ago

Please do! Thanks.

hakamine commented 5 years ago

It looks like there is something in AM dev/issue-24-handle-old-aips break indexing of AIP files. I did the following test for both AM qa/1.x (c6396b9) and dev/issue-24-handle-old-aips (d4ac060) branches:

  1. Delete the aips and aipfiles indexes
  2. Run the rebuild_elasticsearch_aip_index_from_files management command for an ingested AIP containing the "Images" sampledata directory

While the management command finishes without errors for both branches, the resulting aipfiles index in branch AM dev/issue-24-handle-old-aips has all the FILEUUID fields with value null. This causes an internal server error when the dashboard's Archival Storage tab when "Show files" is checked, dashboard log is:

ERROR     2019-02-09 00:20:15  django.request:base:handle_uncaught_exception:256:  Internal Server Error: /archival-storage/search/
Traceback (most recent call last):
  File "/usr/share/archivematica/virtualenvs/archivematica-dashboard/local/lib/python2.7/site-packages/django/core/handlers/base.py", line 132, in get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/opt/archivematica/archivematica/src/dashboard/src/components/archival_storage/views.py", line 180, in search
    'page': page_data,
  File "/usr/share/archivematica/virtualenvs/archivematica-dashboard/local/lib/python2.7/site-packages/django/shortcuts.py", line 67, in render
    template_name, context, request=request, using=using)
...
 File "/usr/share/archivematica/virtualenvs/archivematica-dashboard/local/lib/python2.7/site-packages/django/template/defaulttags.py",
line 493, in render
    url = reverse(view_name, args=args, kwargs=kwargs, current_app=current_app)
  File "/usr/share/archivematica/virtualenvs/archivematica-dashboard/local/lib/python2.7/site-packages/django/core/urlresolvers.py", lin
e 578, in reverse
    return force_text(iri_to_uri(resolver._reverse_with_prefix(view, prefix, *args, **kwargs)))
  File "/usr/share/archivematica/virtualenvs/archivematica-dashboard/local/lib/python2.7/site-packages/django/core/urlresolvers.py", lin
e 495, in _reverse_with_prefix
    (lookup_view_s, args, kwargs, len(patterns), patterns))
NoReverseMatch: Reverse for 'components.archival_storage.views.aip_file_download' with arguments '(None,)' and keyword arguments '{}' no
t found. 1 pattern(s) tried: ['archival-storage/download/aip/file/(?P<uuid>[\\w]{8}(-[\\w]{4}){3}-[\\w]{12})/$']

aipfiles index is attached (dump produced with, e.g.,curl -o search_aipfiles_reindex_devissue24.json http://localhost:9200/aipfiles/_search?pretty=true&q=*:* )(for comparison, the aipfiles index obtained with branch qa/1.x is also attached. search_aipfiles_reindex_devissue24.json.zip search_aipfiles_reindex_qa1x.json.zip

hakamine commented 5 years ago

@sevein I added some commits to AM branch dev/issue-24-aip-index-premis2-fallback (which is based off dev/index-24-handle-old-aips), in order to try to fix the indexing issues mentioned in the previous comment. So far it seems to be working (I'll continue testing with other AIPs). Please let me know if the code looks good and I'll merge these commits to branch dev/index-24-handle-old-aips

It looks we are getting closer to the goal! Should we create PRs for the AM/SS/metsrw dev/issue-24-handle-old-aips branches?

Update 1: while the index worked for the "images" sampledata, I am having some problems with AIP AM68 files.

Update 2: the problem with the AM68 AIP is detailed in https://github.com/archivematica/Issues/issues/504

Update 3: added a commit to dev/issue-24-aip-index-premis2-fallback that fixes https://github.com/archivematica/Issues/issues/504

Update 4: tested the AIP import and reindex management commands with a few more AIPs as shown in the screenshot below. Both the import and reindex commands completed without error messages for all of these tested. However, there is an issue with the "Date stored" value in some AIPs, instead of the date they were originally ingested, it shows the date of the reindex (red box in the screenshot) screen shot 2019-02-16 at 11 23 51 am This occurs for AIPs in group G5 (AIPs) (ingested between 2011/11-2012/09, no namespaces in the METS file, AIPs "AM68" "CVA4-6-7" "AM1553-MI-515" in the screenshot ). However AIPs in group G4 (ingested in 2014, no namespaces in the METS file, AIP "1970-03-03" "1979-09-13_SPEC" in the screenshot) show the correct value for the date. Taking a look at the METS files, in the G5 AIPs the metsHdr CREATEDATE is not present (i.e., the AM version used to ingest these AIPs didn't include it in the METS file), so the reindex script just uses the current date (the date the reindex command runs) in the aips index created field. In conclusion, this is not a bug of the reindex script (it's a limitation caused by lack of information in the METS files)

sevein commented 5 years ago

@hakamine nice job on dev/issue-24-aip-index-premis2-fallback, thanks for your comment and all the updates. It's all working nicely locally so I'd suggest to update dev/index-24-handle-old-aips with all your changes.

sevein commented 5 years ago

@hakamine, FYI Evelyn noticed that after reingest we have metsHdr created as follows:

<mets:metsHdr CREATEDATE="2019-02-20T19:57:40"/>

It may be preferably to use LASTMODDATE instead (see https://github.com/artefactual-labs/mets-reader-writer/blob/1f4b7fdbb37512f2c7c42a750b9f08bc73738d9f/metsrw/mets.py#L183-L192), but we won't be addressing that now since it's not a deal breaker.

hakamine commented 5 years ago

Thank you @sevein, dev/issue-24-handle-old-aips has been updated (and deleted dev/issue-24-aip-index-premis-fallback)

sevein commented 5 years ago

Update:

sallain commented 5 years ago

@hakamine can you take a look at this and change the label to Done if you feel confident that this is resolved?

hakamine commented 5 years ago

Tested on AM qa/1.x 50affa (1.10.0-rc.1) SS qa/0.x fbdf31 (0.15.0-rc.1). Import and reindex of CVA old AIPs is working.