OCFL / spec

The Oxford Common File Layout (OCFL) specifications and website
https://ocfl.io
56 stars 14 forks source link

Make object ID accessible from a small file without having to parse inventory #579

Closed ptsefton closed 1 year ago

ptsefton commented 2 years ago

I thought this had been discussed before but I can’t find an issue.

It would be good to be able to find an OCFL Object’s ID without having to load the inventory, which could be an expensive operation. For example in writing a library you might want to be able to return a list of Object IDs so they can be consumed, eg by an indexer. To get the ID, now though you have to parse the inventory, pass the ID to another process which then also has to parse the inventory to use the object.

Could we just have an id file with string in it (or maybe other useful metadata without the potentially large parts).

pwinckles commented 2 years ago

I agree that it would be nice to have a less expensive way to get an object's id. I currently use a regex to get this in rocfl so I don't have to parse the entire inventory when I just want the id.

zimeon commented 2 years ago

If we were to follow on from the use of NAMASTE to specify the type of an object (Conformance Declaration, 0=ocfl_object_1.0) then we could use that again to allow the identifier in a 4=identitifer_here NAMASTE file. See https://confluence.ucop.edu/download/attachments/14254149/NamasteSpec.pdf . I think I'd lean toward it being optional because depending on the storage approach an extra file for easy access to the id may or may not be considered a worthwhile optimization.

ptsefton commented 2 years ago

@zimeon if you use a NAMASTE file then the identifier-here part would be problematic as it would need to be encoded, and might run into filename limits etc. I think it would be more practical to break out the fixed-size metadata in the inventory from the manifest and version stuff which is potentially quite large. Something like metadata.json.

We have discussed a short-term solution to this, storing an id.json or metadata.json file in the logs directory pending a decision for the OCFL object spec itself.

pwinckles commented 2 years ago

Yes, given that object ids should be URIs, they would need to be encoded if you wanted to use namaste.

@ptsefton It might be more fitting to use an extension in the short-term.

ptsefton commented 2 years ago

@pwinckles Can an extension do things like add an extra file to the object root?

Spec says "The OCFL Object Root must not contain files or directories other than those specified in the following sections."

pwinckles commented 2 years ago

No, you'd just do like you described with the logs dir. So, you'd write the file to extensions/NNNN-object-meta/metadata.json, or whatever you want to call it. The advantage is that it would be formalized and generally usable in 1.0. Whereas the logs dir solution can't really be used by anyone else.

ptsefton commented 2 years ago

Just to clarify, in an object the path to the new metadata file would be ./logs/extensions/NNNN-object-meta/metadata.json relative to the object root and we would document the extension the repository extensions directory?

pwinckles commented 2 years ago

No, I was suggesting putting it in the objects extension directory, so it would be something like ./extensions/NNNN-object-meta/metadata.json.

neilsjefferies commented 1 year ago

There are a variety of ways of approaching this problem. A lot involve some form of caching with no intrinsic OCFL spec changes. Therefore we think this is best kept as an optional extension.

zimeon commented 1 year ago

Editors' discussion 2023-09-22: Per https://github.com/OCFL/spec/issues/579#issuecomment-1414006113 we think this is best addressed by either application caching or and extension