duraspace / pcdm

Portland Common Data Model
http://pcdm.org/models
Apache License 2.0
90 stars 11 forks source link

File use vocabulary using RDF types #7

Closed escowles closed 9 years ago

escowles commented 9 years ago

Alternative to #5

azaroth42 commented 9 years ago

Looks good to me.

escowles commented 9 years ago

@acoburn , @kestlund: this alternate approach came out of discussions today in the Sufia/PCDM sprint. What do you think of making the different use values subclasses of pcdm:File?

ruebot commented 9 years ago

:+1:

7 > #5

...that aside. Have we decided on a procedure for merging (noticing that there are 5 open pull requests)? Tyranny of thumbs?

kestlund commented 9 years ago

Oh, oh, I like! +1

escowles commented 9 years ago

We don't have a procedure, which is why nothing's merged! I'm happy to merge when there are a couple of :+1:'s and no :-1:'s -- does that sound like a good rule of, well, thumb?

acoburn commented 9 years ago

@escowles using classes here make a lot more sense to me than named individuals.

:+1: Looks great!

scossu commented 9 years ago

What is the difference between ExtractedText and Transcript? The examples given seem to conflate to the same or very similar use cases to me.

Also, may I suggest the FADGI [1] nomenclature: PreservationFile -> ArchivalMasterFile or PreservationMasterFile ServiceFile -> ProductionMasterFile ??? -> DerivativeFile (should be superclass of Thumbnail and have a property referring to the derivation source)

[1] http://www.digitizationguidelines.gov/

escowles commented 9 years ago

@scossu, there is some relevant discussion in the comments of the wiki page:

https://wiki.duraspace.org/display/hydra/File+Use+Vocabulary

Basically, extractedText is used for retrieval (to populate fulltext indexes), and is typically automatically generated. Transcript is used for display, and typically involves human effort (transcripts, subtitles, closed captions).

I took a quick look at the digitizationguidelines.gov terms, and they seem like a good fit. Though the terms are from a digitization perspective and therefore don't really include the notion of a born-digital original file, and state that derivative files are being generated from the preservation master file instead of directly from the original file.

I would also expect our Thumbnail to map to ThumbnailImage, and ServiceFile would map more closely to DerivativeFile (service files generally have lower quality to reduce file size). Though I avoided the term "derivative" here because the service file is sometimes the original file. For example, if a user deposits a JPEG, it would be both the service and the original files.

So "PreservationMasterFile" and "ThumbnailImage" seem like good terms to use. But DerivativeFile seems like a narrower sense than ServiceFile to me.

jpstroop commented 9 years ago

:+1:

scossu commented 9 years ago

@escowles, it seems like we resolved the ExtractedText and Transcript issue in the wiki page.

Regarding nomenclature, I was suggesting to map ServiceFile to what FADGI defines as Production Master file [1]. This can either coincide with the original (in this case a resource would have both types and the Object would have both hasProductionMasterFile and hasOriginalFile pointing to it) or be an optimized version for creating derivatives (in which case it would be a special case of derivative of the original file). Does that make sense?

[1] http://www.digitizationguidelines.gov/term.php?term=productionmasterfile

escowles commented 9 years ago

I don't think Production Master File is a good fit for Service File. The Production Master File seems like it differs from the Archival Master File in small ways (cropping, color correcting, reducing noise, combining segments, etc.) but is still a full-quality file. Service Files on the other hand, are lower-quality derivatives in web-appropriate formats. So for images, our Archival Master File would be a 4000px TIFF, and our Service File would be a 1024px JPEG.

scossu commented 9 years ago

I guess I was interpreting the FADGI guidelines a bit freely - if I take the sentence

a new file or files with levels of quality that rival those of the archival master

literally, then I agree that a ServiceFile as you describe it does not correspond to a Production Master file.

Are you planning to use the service file for further derivative generation or for direct use, or both?

The issue is that I have a >1Gb original image, a similarly large preservation copy and need one or more intermediate size files to generate small and large derivatives, thumbnails, or to feed an IIPImage server.

If you have the same use case in mind, I agree that Service File is a good candidate.

escowles commented 9 years ago

We use Service Files for serving to the public. I've also seen them called "access files" or "web files".

We do have a small number of large original files where we have generated intermediate files (e.g., very large map TIFFs), but we have only a handful of these, so that's been a manual process and we didn't add them to the repository. The FADGI definition of archival master file does mention intermediate files, but there is no term for them.

I'm not sure about the kind of hierarchical modeling of original file constituent parts. The complexity there sounds more like a pcdm:Object than the files, which we've been treating as a flat list. Of course, you could attach all of the constituent parts, and link between them to record the hierarchy. At UCSD, we've also packaged up original files into a TAR file in cases like this, so we could treat them as a unit even though they were originally a directory tree.

scossu commented 9 years ago

Re. Service files: at AIC we have one or more intermediate files for each asset, which are used to automatically generate derivatives, so we need to formally define them.

That is why I was wondering if we can take some freedom here and use the Production Master definition which, aside from the definition of "rivaling archival master quality", seems to match the functional characteristics of the intermediate files. Or just make up our own definition that is independent from the FADGI guidelines.

scossu commented 9 years ago

Re. original file: it still seems more appropriate to me to have an actual File as OriginalFile rather than a container. The "aggregator" would still be a file after all (the Word/inDesign/Premiere file).

The archive solution is probably the best for preservation and simplicity purposes. The only down side is that you would have to download the whole archive if you want to access one of the source files. It sounds like a minor disadvantage compared to the convenience of having the whole folder and file structure in its original form though.

escowles commented 9 years ago

I think the term "Production Master File" would be confusing, but I'm open to adding a term for intermediate files. Would creating a broad term like "DerivativeFile" work? You could use it by itself when the derivative file is made as an intermediate file. But it could also be used in combination with Service or Thumbnail files when they are also derivatives.

There are other LDP implementations that do allow creating child resources under files (I think Marmotta allows this). And there has been some discussion of doing this in Fedora 4 (to allow Fixity reports to be contained in the file's description). So that may be possible at some point.

The other option would be to model the original files as a separate pcdm:Object, and then link the access object to the source object. This is more overhead than most implementers would want, but if accessing and downloading individual files from the original fileset, then it could be a good pattern to use.

scossu commented 9 years ago

What do you find confusing about that name? I am not sure how this conflicts with other contexts, but if it does, I think it is OK to divert from the FADGI guidelines.

Technically the intermediate file is a derivative, but it has a special use so it would be nice to subclass it from DerivativeFile.

Creating children of pcdm:File seems to be more useful to recreate a structure such as a TIFF file and its embedded metadata and icons. In this case, if you delete the parent, these would be deleted too. But in a linked project file that might not be the case - actually, a source file might be used by more than one project file, so a hierarchy would not work.

As you mentioned before, the (compressed) archive option sounds the simplest to implement in the short term until we figure out all the use cases and a solution for multiple sources.

escowles commented 9 years ago

I think "production" and "master" are both problematic: "production" since that's the standard name for the software lifecycle/environment stage (as opposed to development/test). "master" since it's a derivative and the "preservation master file" also has "master" in the name, so it becomes less clear which is the "real" master file.

I think "IntermediateFile" is much clearer, and a better representation of the use case we both have: needing to generate a high quality file as a step in processing workflows. For the definition, we could use: "High quality representation of the Object appropriate for generating derivatives or other additional processing".

ruebot commented 9 years ago

"IntermediateFile" :+1:

scossu commented 9 years ago

@escowles : agreed. +1

escowles commented 9 years ago

OK, I've added IntermediateFile, so I think this is ready to merge now!

scossu commented 9 years ago

Are we still considering Derivative and IntermediateFile as its subclass?

azaroth42 commented 9 years ago

Hasty: I suggest that this is a different issue to the PR, and should be moved to an appropriate, more visible location.
Less Hasty: I suggest that future discussions of this nature be expressed first as issues that can be referred to later, and the requirements clarified. Comments on PRs are much less visible and transparency is important. If there's nothing wrong with the PR, further additions can be added later and not hold up the initial process.

awoods commented 9 years ago

I will go ahead and squash/merge unless you want any of the interim commits to remain separate.

escowles commented 9 years ago

No, go ahead and squash away!

awoods commented 9 years ago

Resolved with: https://github.com/duraspace/pcdm/commit/658aa82ea7fc234b7710681af2a4bbb5e9c75b79