Migration of CERN Digitized Videos

CERNDocumentServer / cds-videos

Access articles, reports and multimedia content in HEP

GNU General Public License v2.0

16 stars 34 forks source link

Useful links

The current data model of CDS Videos is available here.
The collection of digitized videos to be migrated in CDS is here.
The documentation of what has been when digitizing is here, with the list of metadata fields.

Data model changes

The first step is to analyze the data model of CDS Videos and understand what changes should be done. Given that, in the future, we will migrate CDS Videos to CDS, the data model changes should be compatible with the InvenioRDM data model (and custom fields).

Extra fields

We should evaluate if these extra fields could go to a JSON blob field, allowing key/values, and the impact of this solution on search capabilities.

Owners

It is not yet clear who should the owner of these records and who can edit metadata. To be discussed and decided. For curation, we should probably create a group "multimedia curators" and decide who goes in.

Considerations

There are duplicated videos: same videos, already in CDS Videos, have been re-digitized. They have the same recid. Both videos, old and new, should be kept. We need to check if the data model supports it. Same for metadata, metadata of existing videos should be enriched by the newly digitized ones.

Relevant code

We should re-use cds-dojson module and the fields rules for CDS Videos. See documentation of dojson: https://dojson.readthedocs.io/en/latest/usage.html for examples.

We will create a branch e.g. digitization-2023 in cds-dojson, where we can apply the modifications to the CDS Videos schema and add new conversion rules. We should update the README to explain why the new branch, with relevant links to the digitization project/process.

cds-dojson usage example:

from cds_dojson.marc21.utils import create_record
from cds_dojson.marc21 import marc21
from cds_dojson.marc21.models.videos.video import model as video_model

# make sure that the XML does not have the first tag XML header:
# <?xml version="1.0" encoding="UTF-8"?>
# otherwise you might have the error:
# ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

PATH = '/tmp/2256680.xml'
marcxml = None
with open(PATH, 'rb') as fp:
    marcxml = fp.read()
blob = create_record(marcxml)

# ALTERNATIVES
# guess video/project by __query__
marc21.do(blob)
# expect directly a video
record = video_model.do(blob)

Current state of the project:

Note: everything was tested using only a local instance/sandbox. Not tested with real CDS Videos.

The migration service fetches (digital memory project and public and not migrated and not announcement) records from (real) CDS and parse them properly (see the documentation to explore all the options regarding single record or multiple records processing). Most tags are now identified and treated (please see the CDS-Dojson PR with all the fixes and updates) to create a JSON compatible with CDS Videos.

The upload function is also working and properly triggers the upload process to CDS Videos. The status of the records check is working and the chunk process also works, making sure you only transfer 10 videos at a time. A migration_state database is automatically updated to make sure you save the state of migration and all the videos you already migrated are not transferred again if you restart the process. Not only the videos are sent to CDS Videos, but also all their metadata extracted by CDS-Dojson (check the CDS-Videos PR where I changed the CDS-Videos schema to start accepting the '_digitization' field as well).

Additionally, the script that generates the CDS MARCXML to update the tag is already written and finally, I already created a CDS-Videos-Transfer PR to make my repository public as part of the CDS group in Github.

Things that still need to be done:

Check the CDS-Djson for (yet another) tag update. Since JY is still adding more videos to the digital memory project, there are tags that haven't been updated yet. Check the tag spreadsheet for more information about the added tags, the ignored tags and even to map and process more tags in the future.
Make CDS-Videos download the '_digitization'.additional_files, not only one file for the video (most videos have multiple file formats for each video).
Change the script that generates the MARCXML that updates CDS to use a proper tag and text on it.
Create a piece of code that checks if transcoding failed of succeeded after migration.
Expand the project to include record updates, not only migration.

CERNDocumentServer / cds-videos