fcrepo-exts / migration-utils

An in-development framework for managing data migrations from previous versions to 4.x.
Apache License 2.0
13 stars 29 forks source link

Add HEAD only migration option #185

Closed mikejritter closed 2 years ago

mikejritter commented 2 years ago

JIRA Ticket: https://fedora-repository.atlassian.net/browse/FCREPO-3674

What does this Pull Request do?

Adds cli option for specifying HEAD only migrations and an option for passing in a list of datastream ids which should only have their HEAD migrated.

What's new?

How should this be tested?

Depending on the dataset available, run a migration with the head only option and with the `--head-only-ids option.

E.g. with the brown dataset:

java -jar target/migration-utils-6.2.0-SNAPSHOT-driver.jar --debug -t legacy -o /data/migration/brown/repostore/data_2019/objects -d /data/migration/brown/repoarchive/datastreams_2019 -a /data/migration/brown-head-only -i work --head-only

Then navigate to the ocfl-root of the migrated dataset and search for objects with multiple datastream versions. This can be done manually or by checking using a mix of rocfl/cli tools:

for id p in `rocfl ls -p`; do dupes=$(find ${p} -mindepth 3 -maxdepth 3 -type f ! -name "*.nt" | cut -c 88- | sort | uniq -d | wc -l); echo "${id} has ${dupes} non-head datastreams" ; done

As a side note, this works for me using zsh but I haven't tested in other shells. The find command lists all files under the content directory and filters out n-triples. Then cut/sort/uniq to trim out excess paths (e.g. to make $uuid/v1/content/ds-id and $uuid/v2/content/ds-id to both be /ds-id) and search for duplicates.

Additional Notes:

I chose the Brown dataset for testing as it has RELS-INT entries which adds datastreams versions during migration if there are modifications which need to occur. My initial changes were failing here but by adding some basic state tracking, similar to the datastreamStates, it is easy enough to check if a datastream should have said changes or not.

Also since deletions occur at the end of a migration, the --head-only option will still migrate the last existing version of a datastream. This feels ok, but thought I would mention it in case there are any other thoughts on how deleted states should be handled.

Interested parties

@fcrepo/committers

pwinckles commented 2 years ago

Actually, thinking about it some more, won't this code still produce an OCFL version per F3 object version?

mikejritter commented 2 years ago

Actually, thinking about it some more, won't this code still produce an OCFL version per F3 object version?

@pwinckles It can create additional OCFL versions from filename changes and deletions (potentially inactive too?). The version created from RELS-INT seemed ok, as it doesn't involve the datastream itself changing. The deletion was something I was thinking about, but it didn't really feel right to omit the datastream entirely.

I was hoping to get to your other comments and push changes up today but likely won't until tomorrow.

pwinckles commented 2 years ago

@mikejritter But, I thought the goal of this task was to squash all F3 versions into a single OCFL version? This PR doesn't do that because it loops over the F3 object versions and creates a new OCFL version for each. I haven't thought about it in depth, but I think what you'd need to do is something like instead of iterating the versions, grab the ObjectReference, a new method to it that only returns the head versions of its datastreams, and iterate those to create the migrated object. I guess the problem with that approach is that you might have trouble getting the created date?

pwinckles commented 2 years ago

Oh, the other problem is that you only want to do this for a subset of datastreams. That's tricky. Maybe your approach is right and there just needs to be some additional logic around whether or not the session is committed? Or maybe I'm wrong and people don't really care about the additional versions and they just want to cleanup their binary history -- like this PR is currently doing.

mikejritter commented 2 years ago

@pwinckles I've only just been able to get back to this. I hadn't really considered a squash -- I had thought about pruning the extra versions from the session but that was before I understood how the migrator was working.

I think the option of having a separate method handle the head only datastreams is interesting, it's just tricky since the RELS-INT can create versions for other datastreams, so either way there needs to be some way to resolve all the changes. Although I suppose if we know only the RELS-INT creates these changes, we could preprocess it and refer to that when handling head only datastreams (maybe handling the delete as well). I'm just thinking out loud though.

Also wrt to what people are actually expecting from this, maybe we should get some feedback from the people who commented on the ticket to see what their expectations are. It wouldn't make sense to merge this and then have the same issue crop up later because people expected x and got y.

mikejritter commented 2 years ago

After discussion on the call last Thursday, I've updated this to create a single version for Fedora objects being migrated. I've removed the --head-only-ids option for now as there is quite a bit of complexity in having migrations allow certain datastreams to migrate with their versions while flattening others.

I also realized while testing that this wouldn't be compatible with atomic migrations, so I'm having the migrator throw an IllegalArgumentException if both are specified. It likely doesn't need too much logic to be supported, but I wanted to get something out which can be tested.