duraspace / pcdm

Portland Common Data Model
http://pcdm.org/models
Apache License 2.0
90 stars 11 forks source link

Clarify a FileSet's membership #57

Closed awead closed 7 years ago

awead commented 7 years ago

A FileSet is the member of only one Object. So, Object:FileSet is 1:M and not M:M. I think these needs to be explicitly stated.

dannylamb commented 7 years ago

I don't want a FileSet to have to live as a member of an Object. I have many use cases where free standing FileSet with descriptive metadata is enough. It will not have child objects nor alternate filesets. A FileSet is a very useful concept, but I do not want to enforce the works structure.

dannylamb commented 7 years ago

^^ Just a clarification, this only pertains to FileSet getting brought in with '2.0'

scossu commented 7 years ago

As per discussion in #59, in 2.0 a FileSet should be an independent entity with M:M relationships with Objects or Collections.

mjgiarlo commented 7 years ago

Hey, @dannylamb, would you mind sharing some of those use cases?

dannylamb commented 7 years ago

We have to provide off the shelf implementations for smaller, less complex objects like simple images or an audio stream. Essentially just a pres master, one or more derivatives, and some descriptive metadata.

<> a pcdm:FileSet
<> pcdm:hasFile <preservationMasterUri>
<> pcdm:hasFile <thumbnailUri>
<> pcdm:hasFile <serviceFileUri>
<> dc:title "Some title" 
<> dc:description "blah blah" 
... more descriptive metadata ...

<preservationMasterUri> a pcdmuse:PreservationMasterFile
<preservationMasterUri> ebucore:hasFileName "perservationMaster.jpeg"
... more technical md ...

<thumbnailUri> a pcdmuse:ThumbnailImage
<thumbnailUri> ebucore:hasFileName "thumbnail.jpeg"
... more technical md ...

<serviceFileUri> a pcdmuse:ServiceFile
<serviceFileUri> ebucore:hasFileName "serviceFile.jpeg"
... more technical md ...

Lather, rinse, repeat for audio, video, generic binaries, etc...

I only want to use a pcdm:Object when there's a need to aggregate other objects or filesets. This comes into play for books, newspapers, serials, etc...

mjgiarlo commented 7 years ago

@dannylamb OK, thanks for sharing that use case! Let me poke at this a little more, as someone who, like you, also started out somewhat skeptical of the proposed PCDM structure.

In your use case, do you see the FileSet as a mere grouping of files or also as a surrogate for an "intellectual object," so to speak? The fact that you might be asserting a bunch of descriptions about the FileSet makes me wonder whether you're coercing the two different concepts together.

Back when we collectively started down the road of implementing PCDM, which was important and timely for me as a developer of Sufia and of Penn State's ScholarSphere repository, I was initially resistant to the idea of needing the two-level (Object/Work & FileSet) hierarchy despite agreeing that ultimately we're after developing a common data model that allows interoperability across a wide variety of content types, use cases, and repository systems.

Viewing this through the cultural heritage-oriented lens of long-term stewardship, though, I wonder if conflating the concepts of a file bundle and an intellectual entity sets us up for headaches down the road. Assume if you will that your content will be aggregated and re-used by other folks, and assume that your content will be built upon and added to fifteen or fifty years from now. If someone wants to add a file to the intellectual entity which is hidden inside the FileSet, the task is more complex than adding a FileSet to an existing Object -- someone would need to know the FileSet is more than just a file bundle, create a new object (with a brand new URI that did not previously exist on the linked open data web), copy the object's descriptive metadata from the FileSet to the object, etc.

I don't believe the above is science fiction. Rather, I believe this is the world we are all hoping to build together. What is the overriding reason for encouraging divergence here? Are there specific performance, modeling, or technical reasons for not wanting the extra layer?

dannylamb commented 7 years ago

@mjgiarlo Thanks for the insight into your community's reasoning. And I want to make clear that if FileSet were to remain in the Works extension, then I have no issue with your interpretation of the aggregation. But if you are to import it into the core ontology, I must flatly reject any interpretation beyond restricting the range of ore:aggregates to files. I want the core ontology to be open to extension and refinement, just as you are doing now. And that means with as few rules as possible. By merging the full restrictions imposed by the Works extension into the core ontology, PCDM will be converted into Hydra's application profile. And this will absolutely discourage the use of PCDM outside of the Hydra community.

Moreover, interoperability through a shared application profile will be highly problematic. Members of this community are not all working on the same piece of software. And deciding for us all to fuse at the hip in the 'Fedora as a database' layer will only grow more painful and costly as our applications inevitably diverge over time. Why would each of our communities set themselves up for unexpected code updates and data migrations because the other requires changes for new features? That raises serious questions of sovereignty for me. I must be given the freedom to support PCDM without allowing a community other than my own to disrupt scheduling or development.

Like you, I know interoperability is a realizable goal. But not by forcing a shared application profile. It's a recipe for disaster. We develop APIs specifically to avoid the pitfalls of that approach. Is it not more in line with the goals of both of our applications to inter-operate through publishing and consuming linked data? Shouldn't we be seeking integration through the semantic web? And then when changes eventually occur, and PCDM 3.0 comes down the pipe, every single Islandora user won't be forced to migrate all of their content in order to remain compatible.

escowles commented 7 years ago

@dannylamb The intention here is not to impose any notions from Hydra or Fedora, but to share an insight we had in the Hydra PCDM implementation (that groups of Files really aren't a pcdm:Object, but their own thing with their own semantics), and try to find common ground with everyone else involved in PCDM to see if it's a useful thing to include in the core ontology. When I wrote up the original PCDM 2.0 document, I was trying to codify the discussions we had at LDCX, where I thought we had a consensus on this approach.

For my part, I was very skeptical of requiring FileSets, and in particular, requiring they be separate from the Page/Component/Part/Whatever they are representing. I agree that most use cases won't have multiple FileSets, and that in practice, you can usually just subsume the group-of-files part into the parent object and be done with it.

But I've been convinced that a FileSet shouldn't be a pcdm:Object — it represents a different thing, with its own creation metadata, and from a completely abstract point-of-view should be a different thing. And there are use cases (including some in my own org's collections that we haven't gotten to yet) where having multiple FileSets representing a single Object will be great. So from a practical standpoint, I'd like a community-supported way of handling that.

And I think the way Files are grouped and linked is too essential to PCDM to have variation. There are a lot of things I'd be happy to have in an extension (e.g., #60, or File Use). If we can't agree on how to support the use cases we have in front of us in a consistent way, then I have a hard time seeing how we go forward with PCDM.

tpendragon commented 7 years ago

Moreover, interoperability through a shared application profile will be highly problematic.

In your opinion, how should we attain interoperability then? What's Islandora's goal for "level of interoperability?" I can't speak for Hydra, but I would want "can write a PCDM ingester, and it gets an Islandora object in Hydra." Is the intended level of interop rather "when we look at this structure, it makes sense"?

And deciding for us all to fuse at the hip in the 'Fedora as a database' layer will only grow more painful and costly as our applications inevitably diverge over time.

I don't think any of us intend for that to be the case.

And then when changes eventually occur, and PCDM 3.0 comes down the pipe, every single Islandora user won't be forced to migrate all of their content in order to remain compatible.

I agree this is a problem, and we're all trying to find the right balance for this too. It seems like Islandora's opinion is that PCDM is done, yes? Objects have Objects, and some of those are collections. If that's the case, then I think we should just gather around that. We lose the ability to automatically crosswalk to other structural schemes (you can only crosswalk without local assumptions if both describe the same levels of structure), but at least PCDM sticks around. So I think my proposal is this:

  1. Add FileSet as a subclass of pcdm:object. It's not exactly what we want, but it will do. Add the appropriate OWL terms to restrict hasMember in FileSets to pcdm:Files. Close #59
  2. Document intended structures. "PDF", "Postcard", "Book with Pages", and "Book with Pages where both the book as a whole and each page has a File."
  3. Close #60? It seems from things Diego has posted that Islandora is more interested in having two structures for IIIF support - so you'd rather build up an extra set of sequences/structures rather than crosswalk.
  4. Close #63.
  5. Do whatever we need to do in #61 to have machine-readable restrictions on our ontology.
cmharlow commented 7 years ago

thanks for the list of possible action items, @tpendragon.

a few questions:

1 . Add FileSet as a subclass of pcdm:object. It's not exactly what we want, but it will do. Add the appropriate OWL terms to restrict hasMember in FileSets to pcdm:Files. Close #59

So does this idea by @escowles become just a Hydra recommended implementation? And what you propose here, is that what this group is agreeing re:Filesets, i.e.:

2 . Document intended structures. "PDF", "Postcard", "Book with Pages", and "Book with Pages where both the book as a whole and each page has a File."

This ties into to the proposed profiles work, IMHO. Happy to email the group to start gathering steam on this with PCDM as it stands now. Might help crystallize stuff being discussed.

3 . Close #60? It seems from things Diego has posted that Islandora is more interested in having two structures for IIIF support - so you'd rather build up an extra set of sequences/structures rather than crosswalk.

Are there links to these things posted, or are they in the discussion of other issues? #60 has suspiciously little discussion on the issue thread, so this seems like perhaps it got hammered out via side channels. Just for transparency sake, it'd be good to link to those on the specific issue thread.

5 . Do whatever we need to do in #61 to have machine-readable restrictions on our ontology.

Looking at this in the context of @dannylamb 's comment here:

I want the core ontology to be open to extension and refinement, just as you are doing now. And that means with as few rules as possible. By merging the full restrictions imposed by the Works extension into the core ontology, PCDM will be converted into Hydra's application profile.

I agree with the point that the restrictions we do not want to add to PCDM is everything, especially as a group, in the HydraWorks application profile. But a review of various app profiles is a good place to find + single out proposals, implicit understandings, etc. for modeling discussion - IMHO.

For #61, I think we are now clarifying what restrictions everyone does want in PCDM. By discussing the specific cases we do want to (or need to) add, we can get to the how. And this leads to me continuing to ask boring, state of discussion questions - like with point 1.

tpendragon commented 7 years ago

So does this idea by @escowles become just a Hydra recommended implementation? And what you propose here, is that what this group is agreeing re:Filesets....

Yes.

do not use pcdm:hasFile

I'm not proposing that - is there tension around hasFile too?

the range of pcdm:hasMember could be another pcdm:Object, pcdm:Fileset or pcdm:File ... ?

Necessarily, since pcdm:FileSets would remain pcdm:Objects. pcdm:Files aren't subclasses of Object, and have their own predicate - hasFile.

Are there links to these things posted, or are they in the discussion of other issues? #60 has suspiciously little discussion on the issue thread, so this seems like perhaps it got hammered out via side channels. Just for transparency sake, it'd be good to link to those on the specific issue thread.

Good point, I retract my statement. Discussion still needs to happen there.

mjgiarlo commented 7 years ago

:speech_balloon: @dannylamb

@mjgiarlo Thanks for the insight into your community's reasoning. And I want to make clear that if FileSet were to remain in the Works extension, then I have no issue with your interpretation of the aggregation. But if you are to import it into the core ontology, I must flatly reject any interpretation beyond restricting the range of ore:aggregates to files. I want the core ontology to be open to extension and refinement, just as you are doing now. And that means with as few rules as possible. By merging the full restrictions imposed by the Works extension into the core ontology, PCDM will be converted into Hydra's application profile. And this will absolutely discourage the use of PCDM outside of the Hydra community.

I'm puzzled by your response, Danny. I offered (what I thought were) non-Hydra-specific thoughts about why our communities might not want to conflate notions of file bundles with intellectual entities with an eye towards long-term stewardship and interoperability, asking questions about your use cases. What I am reading in your response is:

  1. An ontology with three entities is a domain model, but an ontology with four entities is a Hydra application profile;
  2. PCDM 2.0 sets a dangerous precedent whereby Hydra folks can force changes on all PCDM implementers;
  3. If PCDM 2.0 goes forward with FileSets as required, we're out.

All are valid concerns worth discussing -- I'm mostly puzzled that the non-Hydra-specific questions I asked elicited them, when what I wanted was greater understanding of where you're coming from re: your use cases, and where we're going as a community of folks adopting a shared data model.

So, in response to the three good issues raised above:

  1. This bears more discussion. This is why we're here discussing it! I'd love to hear more about why y'all are comfortable having file bundles and intellectual entities collapsed together. I honestly don't know very much about the content types ye Islandora folks are building use cases around, and I'm happy to do homework on this if you can point me in the right direction.
  2. That's why we're having these discussions together.
  3. That's why we're here talking about our use cases and rationale.
whikloj commented 7 years ago

@mjgiarlo

This bears more discussion. This is why we're here discussing it! I'd love to hear more about why y'all are comfortable having file bundles and intellectual entities collapsed together.

I think maybe I don't understand the specific metadata that would be stored on the FileSets object if there is only one FileSets object and how this would differ from the metadata that would be stored on either the Object or the actual File.

How do you see that container being of use or maybe what would you store on the FileSet in a single FileSet scenario?

mjgiarlo commented 7 years ago

@whikloj Howdy, Jared. Good questions.

I think maybe I don't understand the specific metadata that would be stored on the FileSets object if there is only one FileSets object and how this would differ from the metadata that would be stored on either the Object or the actual File.

The only use I have around descriptive metadata on a FileSet is to assert a label describing what the FileSet bundles together. You can see this in the three use cases expressed here: https://github.com/hybox/models/tree/master/notes This differs from the descriptive metadata on the Object of which the FileSet is a member, because the Object's descriptive metadata describes the intellectual entity, so the book or the photograph or the monograph that has the FileSet as a member.

How do you see that container being of use or maybe what would you store on the FileSet in a single FileSet scenario?

I assume "that container" means the FileSet? As to why I find it useful to separate the FileSet from the Object, see my comment above.

whikloj commented 7 years ago

@mjgiarlo Correct I did mean the FileSet, sorry for the ambiguity there.

Based on your answer and the notes (I think this one was the most relevant)

So (if I am understanding the argument) the benefit is that of the separation of physical object from virtual representation of that object.

So for the question

... do you see the FileSet as a mere grouping of files or also as a surrogate for an "intellectual object," ...

I see it as a mere grouping of files, which is why I'm not super excited about plugging it in all my resources. We are perhaps behind in our handling of linked data and digitization.

This distinction between Real World and Digital is not a concern of anyone I deal with here, not that it is not a valid concern and/or consideration, just that no one handing out projects seems as worried about it as they are with getting the digitized content up and accessible.

But if a pcdm:FileSet is a pcdm:Object, then if we later want to improve our resources to apply them as "digital representations" to a Real World Object. Could we not attach our pcdm:Objects (with the associated pcdm:Files) to this new pcdm:Object (lets call it pcdm:RealObject)?

Then our old pcdm:Object becomes a stand-in for a pcdm:FileSet (heck we can patch it and change it later), but with the benefit of delaying that decision until we are sure what the RWO will be? 'Cause as I said, we aren't having those discussions right now.

Thought?

mjgiarlo commented 7 years ago

Hi, @whikloj.

So (if I am understanding the argument) the benefit is that of the separation of physical object from virtual representation of that object.

Ha, no, I consider that (the distinction between RWOs and digital representation) a separate issue though it's related.

The benefit is in keeping file bundlings separate from the intellectual objects that contain them. A while back, I seem to recall you sharing that your primary use case is newspapers. Do I have that right? The distinction here would be disentangling any metadata about a page image from metadata about the page (the digital page object, not the RWO). Why?

Viewing this through the cultural heritage-oriented lens of long-term stewardship, though, I wonder if conflating the concepts of a file bundle and an intellectual entity sets us up for headaches down the road. Assume if you will that your content will be aggregated and re-used by other folks, and assume that your content will be built upon and added to fifteen or fifty years from now. If someone wants to add a file to the intellectual entity which is hidden inside the FileSet, the task is more complex than adding a FileSet to an existing Object -- someone would need to know the FileSet is more than just a file bundle, create a new object (with a brand new URI that did not previously exist on the linked open data web), copy the object's descriptive metadata from the FileSet to the object, etc. -- @mjgiarlo

I suspect examples will help us more than any additional yada yada I may have to offer. If you have an object that has three distinct files, each of which has a label and 1-2 derivatives, how would you model that in PCDM?

But if a pcdm:FileSet is a pcdm:Object, then if we later want to improve our resources to apply them as "digital representations" to a Real World Object. Could we not attach our pcdm:Objects (with the associated pcdm:Files) to this new pcdm:Object (lets call it pcdm:RealObject)?

I imagine you can but... oof, so many "objects." We need better language. ;) Would you mind sketching this out a bit (either in prose/snippets or visually, or whatever)?

whikloj commented 7 years ago

Ok @mjgiarlo, I think I understand your reasoning (or I'm getting a better grasp of it).

I think essentially my concern is one of sad pragmatic concerns when I compare them with your "preserving the world's knowledge for future generations." What I mean is, stop making me look bad 😉

But long story short I can see a use case for FileSets in newspapers, I think @dannylamb had already seen one for compound objects.

So if our current (Fedora 3) structure is: old_newspaper

Then I can see a new structure using pcdm:FileSets like: new_newspaper1

But what I want to avoid is: new_newspaper2

The extra FileSet on the newspaper and issue for the sake of a thumbnail is just extra objects that add up to eat into our storage... and unfortunately this is a concern for me.

scossu commented 7 years ago

@whikloj Why not reuse thumbnail resources?

pcdm2_thumbnail

The hasThumbnail predicate would be out of the scope of PCDM of course, but something that Islandora can allow the end user to establish.

whikloj commented 7 years ago

@scossu that is definitely an option, and I was thinking that same thing once I started drawing it out.

But I left the question up because for something like a Collection, you may not want to use an existing thumbnail in that case.

scossu commented 7 years ago

@whikloj Unless I am reading your #2 and #3 graphs as having dedicated thumbnails. In that case I would see convenient to have a FileSet for the issue that contains the whole publication as PDF and a thumbnail derivative that is used for the issue object.

whikloj commented 7 years ago

@scossu no you are correct that in the case of a newspaper, I should shift to using a pointer to the existing thumbnail. Issue derives from 1st page, and Newspaper... gets one from some issue (not always the first one).

But I'm not sure that I am getting any benefit from the FileSet on the newspaper level object.

Even if I want to use a custom thumbnail I can just store the thumbnail with a hasFile, as there is almost no chance of another person adding a different version.

But if it does happen, then I incur the additional costs of creating a FileSet and attaching it to the newspaper level object and moving the Files (read: thumbnail) into the FileSet.

Because I don't see that happening too often, I save that storage and complexity until it is needed. That's what I am wondering, about delaying that complexity until I actually need it.

mjgiarlo commented 7 years ago

OK, thanks for sketching that out, @whikloj! One discrepancy I noted between your original diagram and what @scossu drew is that he turned Pages into objects and you had them as FileSets. (FWIW, I'd probably do what @scossu diagrammed so that you can assert descriptive metadata about the Page independent of metadata about the files bundled together as representations of the Page.) Agree too that it'd make sense for the Newspaper and Issue objects to link to thumbnails contained with Page-level FileSets if that'd work for your use cases.

But I'm not sure that I am getting any benefit from the FileSet on the newspaper level object.

That's the central question of this issue, I suppose, eh? :) Whether you support FileSets as a required layer between Objects and Files probably depends on how you feel about:

  1. assertions about future flexibility afforded by the FileSet
  2. assertions about optimizations gained from the predictability of that layer being present
  3. assertions about PCDM supporting a broad set of use cases beyond the scope of any one institution

I find those compelling and thus I'm down with adding FileSets as a new, required layer (though what I'm not saying, and I'm not hearing, is that Hydra requires this to be so to make use of PCDM). I'm guessing you find them less compelling, thus you'd find the proposition behind this issue less compelling. Case in points?

ruebot commented 7 years ago

I'm guessing you find them less compelling, thus you'd find the proposition behind this issue less compelling.

I think that is really it from our -- Islandora -- perspective. I think I can honestly say that we do see the usefulness of FileSets, and they can come in handy when they are needed. But, requiring them, or making them mandatory, is not useful from our perspective. It is an extra layer of complexity, and overhead that is not needed 100% of the time.

mjgiarlo commented 7 years ago

Thanks, @ruebot! This has been :eyes:-opening, in a good way. :)

whikloj commented 7 years ago

@mjgiarlo yeah I saw the difference between my drawing and @scossu's. I can see places I might use FileSets, but in a single FileSet scenario I'm not sold on the additional layer yet. I think @ruebot has captured my feelings on this.

escowles commented 7 years ago

Discussion of FileSets has moved on — closing this issue. There is still work going on in the Hydra community about how FileSets should work, and what they represent, and making that compatible with the core ontology.