Closed edsu closed 2 years ago
@edsu It looks like what's happening here is that it's looking for a datastream with the id druid:cf604sc3425+rightsMetadata+rightsMetadata.0
but can't find it. Can you confirm that that datastream exists in /home/lyberadmin/apps/fedora/home/data/datastreamStore
?
Yes, you're right. That file doesn't exist in datastreamStore
. I'm just remembering now that I'm working with a possibly partial copy of our production Fedora instance. So I suspect this is a problem with the copy and not migration-utils.
I just noticed that I might be looking in the wrong directory. I noticed that the object file was located in:
objectStore/6f/info%3Afedora%2Fdruid%3Acf604sc3425
I was assuming it would be in objectStore/cf
because that's how the pid started? I thought maybe the druid:cf604sc3425+rightsMetadata+rightsMetadata.0
file would likewise be in:
datastreamStore/6f/druid:cf604sc3425+rightsMetadata+rightsMetadata.0
But I didn't see it there. The filesystem is quite large to search, do you happen to know what directory I should look in based on the PID?
@edsu I see that you ran your migration command with --source-type legacy
. If your repository is using the legacy format, then I would expect to see date based paths like 2009/0508/20/48
within your datastreams directory. Is this what you see? If not, are you using akubra? The akubra layout is based on the md5 hash of the object PID plus the datastream id.
Nice catch, I see sub-directories named with two character hex values 00-FF. I guess this is akubra? I still get the same Unable to resolve internal ID
errors when running with --source-type akubra
. I guess I will need to search the filesystem to see if the file is there, and then see if it is being looked up correctly by migration-utils?
So, the code does a full scan of your datastreams and creates an index that maps id to location, which is why it would still fail even after passing it the akubra
source type.
This is the Fedora 3 code that maps to akubra paths: https://github.com/fcrepo3/fcrepo/blob/37df51b9b857fd12c6ab8269820d406c3c4ad774/fcrepo-server/src/main/java/org/fcrepo/server/storage/lowlevel/akubra/HashPathIdMapper.java
If the internal id druid:cf604sc3425+rightsMetadata+rightsMetadata.0
is correct (I don't know if it is or not), then I would try looking under e0/80
Hmm, I only have one level of sub-directories under datastreamStore
. I need to double check with people more knowledgeable than me to see if we have a customized Fedora Store.
@edsu The format is configurable. See: https://wiki.lyrasis.org/display/FEDORA38/Configuring+Low+Level+Storage#ConfiguringLowLevelStorage-AkubraLowLevelStorage(New)
Look for a line like this in your config: https://github.com/fcrepo3/fcrepo/blob/37df51b9b857fd12c6ab8269820d406c3c4ad774/fcrepo-server/src/main/resources/fcfg/server/akubra-llstore.xml#L62
I guess the default is just one level, ##
. So, perhaps it's under e0/
?
Alas it's not under e0
. I started a search across the whole tree to see where it might be.
In the meantime I noticed that the datastreamStore
files seem to have a info:fedora prefix (this is just a list of the first 5 files in the 00
directory):
$ ls 00 | head -5
info%3Afedora%2Fdruid%3Abb001hh3574%2FprovenanceMetadata%2FprovenanceMetadata.0
info%3Afedora%2Fdruid%3Abb006gm6059%2FrightsMetadata%2FrightsMetadata.0
info%3Afedora%2Fdruid%3Abb006hs1829%2FprovenanceMetadata%2FprovenanceMetadata.0
info%3Afedora%2Fdruid%3Abb009mb1479%2FidentityMetadata%2FidentityMetadata.0
info%3Afedora%2Fdruid%3Abb011vh2231%2FprovenanceMetadata%2FprovenanceMetadata.0
And that 00
are the first two characters of the md5 of the URL decoded filename:
>>> from urllib.parse import unquote
>>> from hashlib import md5
>>> md5(unquote('info%3Afedora%2Fdruid%3Abb001hh3574%2FprovenanceMetadata%2FprovenanceMetadata.0').encode('utf8')).hexdigest()
'00345d15cdddf9a509fd0666599fdd18'
Is it possible that migration-utils is looking up the id without the info:fedora
prefix and not finding it because the md5 that was calculated is incorrect? Should I be doing something with --id-prefix
?
@edsu Yes, the path is based on the md5 hash of the identifier. You are correct that it is supposed to start with info:fedora
. Can you post the FOXML for druid:cf604sc3425
? It seems like it has a bad REF
in it that is pointing to an incorrectly formatted internal id.
Sure, I can share it with you via email (just drop me an email at ehs@pobox.com and I can reply with it).
In other news, my find command finished overnight. It looks like the datastream is here:
datastreamStore/c7/info%3Afedora%2Fdruid%3Acf604sc3425%2FrightsMetadata%2FrightsMetadata.0
But decoding that path yields a (slightly) different id:
info:fedora/druid:cf604sc3425/rightsMetadata/rightsMetadata.0
This uses /
where the id in the log message had +
?
ERROR 05:05:41.716 (Migrator) MIGRATION_FAILURE: pid="druid:cf604sc3425", message="Unable to resolve internal ID "druid:cf604sc3425+rightsMetadata+rightsMetadata.0"!"
java.lang.RuntimeException: Unable to resolve internal ID "druid:cf604sc3425+rightsMetadata+rightsMetadata.0"!
Thanks for the FOXML. It contains numerous content locations likes:
<foxml:contentLocation TYPE="INTERNAL_ID" REF="druid:cf604sc3425+rightsMetadata+rightsMetadata.0"/>
Based on things you've said elsewhere, it seems like your organization may have been manipulating the FOXML out of band? Did Fedora 3 construct those references or did some other process? Fedora internal ids should always be prefixed with info:fedora/
, so it's not clear to me why they aren't here.
Yes, hand editing has definitely been possible. But the weird thing is that 999 of the 1000 objects I tested with had the same problem. Maybe I should sample a different range of PIDs, but I suspect it will be consistent. I can double check that Fedora is able to return these datastreams ok.
I just tested with 50,000 objects and 49,995 of them have this same problem. So at least it is (mostly) consistent. Fedora itself is able return the datastream. I'm assuming it uses the Object XML to do that, but I also know there is a MySQL database so perhaps not.
I'll inquire to see if we've modified our Fedora instance in some way. But I'm new and a lot of this work was done a long time ago, so ... I'm assuming you've had others use the Akubra migration successfully?
Yes, numerous people have successfully done an akubra migration. This is the first time I've seen this problem. Without doing a code dive, I'm not sure how your Fedora instance is able to retrieve those datastreams. I've asked some people who are more knowledgeable about Fedora 3 to take a look at this ticket.
<foxml:contentLocation TYPE="INTERNAL_ID" REF="druid:cf604sc3425+rightsMetadata+rightsMetadata.0"/>
is just the location where the data can be found on export, and is there for reingest when ingesting this exported FOXML. It has no bearing on the actual location of the data within the Fedora filesystem.
I'm suspecting that there may be some kind character encoding gotcha buried here in the migration-utils code.
Can you create a sample set of 5 or so PIDs to migrate, run the migrator on the sample set of PIDs with the debug flag turned on, and post the complete log file somewhere?
Okay, so it looks like this is the code that's being run when it constructs the index:
Which produces druid:cf604sc3425+rightsMetadata+rightsMetadata.0
when given the input info%3Afedora%2Fdruid%3Acf604sc3425%2FrightsMetadata%2FrightsMetadata.0
. So, now I'm really confused, because that matches the id in your REF
, so I would expect it to be able to find the datastream in the index.
If you enabled trace logging you could see what it was putting in the index:
Attached is the debug log when running against 5 objects with the following command:
java -jar migration-utils-6.1.0-driver.jar \
--source-type akubra \
--datastreams-dir /home/lyberadmin/apps/fedora/home/data/datastreamStore \
--objects-dir /home/lyberadmin/apps/fedora/home/data/objectStore \
--target-dir ocfl \
--limit 5 \
--continue-on-error \
--debug \
> convert.log
Is there an easy way to enable trace logging?
Oh, I bet what is happening is the first time I ran it with --source-type legacy
and it created the index. Then I ran it again with --source-type akubra
and it skipped index creation since it already existed (it took hours to build). So even though it is looking for akubra the index is built with legacy?
Yes, I was about to ask if you cleared the index. Try that, then run with akubra.
Ah! That makes sense. Yes, I bet that's what the problem is.
It was the problem! Thanks for helping debug and sanity check this @pwinckles & @sparter -- it is greatly appreciated!
I tried running with this command on a 1000 of our objects:
I suspect this is the result of our idosyncratic use of Fedora, but 999/1000 of the objects failed with this error:
Is there any obvious solution to this?