fcrepo-exts / migration-utils

An in-development framework for managing data migrations from previous versions to 4.x.
Apache License 2.0
13 stars 29 forks source link

Unable to resolve internal ID #181

Closed edsu closed 2 years ago

edsu commented 2 years ago

I tried running with this command on a 1000 of our objects:

java -jar migration-utils-6.1.0-driver.jar \
  --source-type legacy \
  --datastreams-dir /home/lyberadmin/apps/fedora/home/data/datastreamStore \
  --objects-dir /home/lyberadmin/apps/fedora/home/data/objectStore \
  --target-dir ocfl \
  --limit 1000 \
  --continue-on-error \
  --debug

I suspect this is the result of our idosyncratic use of Fedora, but 999/1000 of the objects failed with this error:

ERROR 05:05:41.716 (Migrator) MIGRATION_FAILURE: pid="druid:cf604sc3425", message="Unable to resolve internal ID "druid:cf604sc3425+rightsMetadata+rightsMetadata.0"!"
java.lang.RuntimeException: Unable to resolve internal ID "druid:cf604sc3425+rightsMetadata+rightsMetadata.0"!
        at org.fcrepo.migration.foxml.DirectoryScanningIDResolver.resolveInternalID(DirectoryScanningIDResolver.java:129)
        at org.fcrepo.migration.foxml.FoxmlInputStreamFedoraObjectProcessor$Foxml11DatastreamVersion.<init>(FoxmlInputStreamFedoraObjectProcessor.java:401)
        at org.fcrepo.migration.foxml.FoxmlInputStreamFedoraObjectProcessor.processObject(FoxmlInputStreamFedoraObjectProcessor.java:194)
        at org.fcrepo.migration.Migrator.run(Migrator.java:161)
        at org.fcrepo.migration.PicocliMigrator.call(PicocliMigrator.java:328)
        at org.fcrepo.migration.PicocliMigrator.call(PicocliMigrator.java:51)
        at picocli.CommandLine.executeUserObject(CommandLine.java:1743)
        at picocli.CommandLine.access$900(CommandLine.java:145)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2101)
        at picocli.CommandLine$RunLast.handle(CommandLine.java:2068)
        at picocli.CommandLine$AbstractParseResultHandler.execute(CommandLine.java:1935)
        at picocli.CommandLine.execute(CommandLine.java:1864)
        at org.fcrepo.migration.PicocliMigrator.main(PicocliMigrator.java:175)

Is there any obvious solution to this?

pwinckles commented 2 years ago

@edsu It looks like what's happening here is that it's looking for a datastream with the id druid:cf604sc3425+rightsMetadata+rightsMetadata.0 but can't find it. Can you confirm that that datastream exists in /home/lyberadmin/apps/fedora/home/data/datastreamStore?

edsu commented 2 years ago

Yes, you're right. That file doesn't exist in datastreamStore. I'm just remembering now that I'm working with a possibly partial copy of our production Fedora instance. So I suspect this is a problem with the copy and not migration-utils.

edsu commented 2 years ago

I just noticed that I might be looking in the wrong directory. I noticed that the object file was located in:

objectStore/6f/info%3Afedora%2Fdruid%3Acf604sc3425

I was assuming it would be in objectStore/cf because that's how the pid started? I thought maybe the druid:cf604sc3425+rightsMetadata+rightsMetadata.0 file would likewise be in:

datastreamStore/6f/druid:cf604sc3425+rightsMetadata+rightsMetadata.0

But I didn't see it there. The filesystem is quite large to search, do you happen to know what directory I should look in based on the PID?

pwinckles commented 2 years ago

@edsu I see that you ran your migration command with --source-type legacy. If your repository is using the legacy format, then I would expect to see date based paths like 2009/0508/20/48 within your datastreams directory. Is this what you see? If not, are you using akubra? The akubra layout is based on the md5 hash of the object PID plus the datastream id.

edsu commented 2 years ago

Nice catch, I see sub-directories named with two character hex values 00-FF. I guess this is akubra? I still get the same Unable to resolve internal ID errors when running with --source-type akubra. I guess I will need to search the filesystem to see if the file is there, and then see if it is being looked up correctly by migration-utils?

pwinckles commented 2 years ago

So, the code does a full scan of your datastreams and creates an index that maps id to location, which is why it would still fail even after passing it the akubra source type.

This is the Fedora 3 code that maps to akubra paths: https://github.com/fcrepo3/fcrepo/blob/37df51b9b857fd12c6ab8269820d406c3c4ad774/fcrepo-server/src/main/java/org/fcrepo/server/storage/lowlevel/akubra/HashPathIdMapper.java

If the internal id druid:cf604sc3425+rightsMetadata+rightsMetadata.0 is correct (I don't know if it is or not), then I would try looking under e0/80

edsu commented 2 years ago

Hmm, I only have one level of sub-directories under datastreamStore. I need to double check with people more knowledgeable than me to see if we have a customized Fedora Store.

pwinckles commented 2 years ago

@edsu The format is configurable. See: https://wiki.lyrasis.org/display/FEDORA38/Configuring+Low+Level+Storage#ConfiguringLowLevelStorage-AkubraLowLevelStorage(New)

Look for a line like this in your config: https://github.com/fcrepo3/fcrepo/blob/37df51b9b857fd12c6ab8269820d406c3c4ad774/fcrepo-server/src/main/resources/fcfg/server/akubra-llstore.xml#L62

I guess the default is just one level, ##. So, perhaps it's under e0/?

edsu commented 2 years ago

Alas it's not under e0. I started a search across the whole tree to see where it might be.

In the meantime I noticed that the datastreamStore files seem to have a info:fedora prefix (this is just a list of the first 5 files in the 00 directory):

$ ls 00 | head -5
info%3Afedora%2Fdruid%3Abb001hh3574%2FprovenanceMetadata%2FprovenanceMetadata.0
info%3Afedora%2Fdruid%3Abb006gm6059%2FrightsMetadata%2FrightsMetadata.0
info%3Afedora%2Fdruid%3Abb006hs1829%2FprovenanceMetadata%2FprovenanceMetadata.0
info%3Afedora%2Fdruid%3Abb009mb1479%2FidentityMetadata%2FidentityMetadata.0
info%3Afedora%2Fdruid%3Abb011vh2231%2FprovenanceMetadata%2FprovenanceMetadata.0

And that 00 are the first two characters of the md5 of the URL decoded filename:

>>> from urllib.parse import unquote
>>> from hashlib import md5
>>> md5(unquote('info%3Afedora%2Fdruid%3Abb001hh3574%2FprovenanceMetadata%2FprovenanceMetadata.0').encode('utf8')).hexdigest()
'00345d15cdddf9a509fd0666599fdd18'

Is it possible that migration-utils is looking up the id without the info:fedora prefix and not finding it because the md5 that was calculated is incorrect? Should I be doing something with --id-prefix?

pwinckles commented 2 years ago

@edsu Yes, the path is based on the md5 hash of the identifier. You are correct that it is supposed to start with info:fedora. Can you post the FOXML for druid:cf604sc3425? It seems like it has a bad REF in it that is pointing to an incorrectly formatted internal id.

edsu commented 2 years ago

Sure, I can share it with you via email (just drop me an email at ehs@pobox.com and I can reply with it).

In other news, my find command finished overnight. It looks like the datastream is here:

datastreamStore/c7/info%3Afedora%2Fdruid%3Acf604sc3425%2FrightsMetadata%2FrightsMetadata.0

But decoding that path yields a (slightly) different id:

info:fedora/druid:cf604sc3425/rightsMetadata/rightsMetadata.0

This uses / where the id in the log message had +?

ERROR 05:05:41.716 (Migrator) MIGRATION_FAILURE: pid="druid:cf604sc3425", message="Unable to resolve internal ID "druid:cf604sc3425+rightsMetadata+rightsMetadata.0"!"
java.lang.RuntimeException: Unable to resolve internal ID "druid:cf604sc3425+rightsMetadata+rightsMetadata.0"!
pwinckles commented 2 years ago

Thanks for the FOXML. It contains numerous content locations likes:

<foxml:contentLocation TYPE="INTERNAL_ID" REF="druid:cf604sc3425+rightsMetadata+rightsMetadata.0"/>

Based on things you've said elsewhere, it seems like your organization may have been manipulating the FOXML out of band? Did Fedora 3 construct those references or did some other process? Fedora internal ids should always be prefixed with info:fedora/, so it's not clear to me why they aren't here.

edsu commented 2 years ago

Yes, hand editing has definitely been possible. But the weird thing is that 999 of the 1000 objects I tested with had the same problem. Maybe I should sample a different range of PIDs, but I suspect it will be consistent. I can double check that Fedora is able to return these datastreams ok.

edsu commented 2 years ago

I just tested with 50,000 objects and 49,995 of them have this same problem. So at least it is (mostly) consistent. Fedora itself is able return the datastream. I'm assuming it uses the Object XML to do that, but I also know there is a MySQL database so perhaps not.

I'll inquire to see if we've modified our Fedora instance in some way. But I'm new and a lot of this work was done a long time ago, so ... I'm assuming you've had others use the Akubra migration successfully?

pwinckles commented 2 years ago

Yes, numerous people have successfully done an akubra migration. This is the first time I've seen this problem. Without doing a code dive, I'm not sure how your Fedora instance is able to retrieve those datastreams. I've asked some people who are more knowledgeable about Fedora 3 to take a look at this ticket.

sprater commented 2 years ago

<foxml:contentLocation TYPE="INTERNAL_ID" REF="druid:cf604sc3425+rightsMetadata+rightsMetadata.0"/> is just the location where the data can be found on export, and is there for reingest when ingesting this exported FOXML. It has no bearing on the actual location of the data within the Fedora filesystem.

I'm suspecting that there may be some kind character encoding gotcha buried here in the migration-utils code.

Can you create a sample set of 5 or so PIDs to migrate, run the migrator on the sample set of PIDs with the debug flag turned on, and post the complete log file somewhere?

pwinckles commented 2 years ago

Okay, so it looks like this is the code that's being run when it constructs the index:

https://github.com/fcrepo-exts/migration-utils/blob/a14d64c4d35afa134c1466b595fe0f2531136465/src/main/java/org/fcrepo/migration/foxml/AkubraFSIDResolver.java#L42-L56

Which produces druid:cf604sc3425+rightsMetadata+rightsMetadata.0 when given the input info%3Afedora%2Fdruid%3Acf604sc3425%2FrightsMetadata%2FrightsMetadata.0. So, now I'm really confused, because that matches the id in your REF, so I would expect it to be able to find the datastream in the index.

pwinckles commented 2 years ago

If you enabled trace logging you could see what it was putting in the index:

https://github.com/fcrepo-exts/migration-utils/blob/a14d64c4d35afa134c1466b595fe0f2531136465/src/main/java/org/fcrepo/migration/foxml/DirectoryScanningIDResolver.java#L149

edsu commented 2 years ago

Attached is the debug log when running against 5 objects with the following command:

java -jar migration-utils-6.1.0-driver.jar \
  --source-type akubra \
  --datastreams-dir /home/lyberadmin/apps/fedora/home/data/datastreamStore \
  --objects-dir /home/lyberadmin/apps/fedora/home/data/objectStore \
  --target-dir ocfl \
  --limit 5 \
  --continue-on-error \
  --debug \
  > convert.log

Is there an easy way to enable trace logging?

convert.log

edsu commented 2 years ago

Oh, I bet what is happening is the first time I ran it with --source-type legacy and it created the index. Then I ran it again with --source-type akubra and it skipped index creation since it already existed (it took hours to build). So even though it is looking for akubra the index is built with legacy?

sprater commented 2 years ago

Yes, I was about to ask if you cleared the index. Try that, then run with akubra.

pwinckles commented 2 years ago

Ah! That makes sense. Yes, I bet that's what the problem is.

edsu commented 2 years ago

It was the problem! Thanks for helping debug and sanity check this @pwinckles & @sparter -- it is greatly appreciated!