DataConservancy / dcs-packaging-tool

The Data Conservancy Packaging Tool
http://dataconservancy.github.io/dcs-packaging-tool
2 stars 3 forks source link

Parts of the graph serialization are interspersed with the original payload data under /data directory #5

Open htpvu opened 8 years ago

htpvu commented 8 years ago

This has caused some usage issues for the DMS team. A feature to allow user to specify a different location (outside of the payload /data directory) would be useful.

tdilauro commented 8 years ago

There are several reasons to support at least configuration:

rduerr commented 8 years ago

Ruth backs Tim

birkland commented 8 years ago

So neither of you consider domain objects, or metadata centered by the tool, to be custodial content of the package. In that case, the package tool would need something in the UI that indicates this. Would it be granular enough (or intuitive enough) to have a "do not package the metadata entered by this tool as custodial content" check box?

birkland commented 8 years ago

Also, the choice the tool makes now is the conservative choice - the tool doesn't necessarily know a priori how the client views the domain objects created by the tool, so it assumes that everything is custodial content (packaged data) by default (i.e. the tool user may be creating new intellectual assets that are intended to be conveyed as part of a total, packaged work). So that's the perspective on why we made the tool put everything under /data by default

tdilauro commented 8 years ago

We just need the ability to place domain objects outside of the payload. There are multiple ways to achieve this. I would add that the only use case I have at present that might require support for the current format (data/bin/, data/obj/) is the need to transform packages that we've already created into the desired format (data/).

Certainly, we care about the metadata and domain objects, but both are distinct from the data produced by researchers, instruments, etc. That While the assertions (in metadata, domain objects, graph) might change over time, the data is much less likely to do so in the use cases of the DMS (and probably NSIDC, as well, given their current use cases for DC packages/DCS pkg ingest). And when it does change, it usually means a new version (new ID, etc).

Assuming that payload (and thus "custodial content") connote the same concept as "payload" from the bagit specificiation, then I agree with definition of a Package in Section 2.2 (Terminology) DC Packaging Spec, which implies this same distinction:

Package: A logical unit of digital content conforming to this specification. It contains a payload, Domain Objects describing the payload, a manifest of Domain Objects, and additional package level metadata.

rduerr commented 8 years ago

Thanks for being so eloquent Tim… Much better than How I would have put it! Ruth

On Jun 14, 2016, at 9:00 PM, Tim DiLauro notifications@github.com wrote:

We just need the ability to place domain objects outside of the payload. There are multiple ways to achieve this. I would add that the only use case I have at present that might require support for the current format (data/bin/, data/obj/) is the need to transform packages that we've already created into the desired format (data/).

Certainly, we care about the metadata and domain objects, but both are distinct from the data produced by researchers, instruments, etc. That While the assertions (in metadata, domain objects, graph) might change over time, the data is much less likely to do so in the use cases of the DMS (and probably NSIDC, as well, given their current use cases for DC packages/DCS pkg ingest). And when it does change, it usually means a new version (new ID, etc).

Assuming that payload (and thus "custodial content") connote the same concept as "payload" from the bagit specificiation, then I agree with definition of a Package in Section 2.2 (Terminology) DC Packaging Spec http://dataconservancy.github.io/dc-packaging-spec/dc-packaging-spec-1.0.html, which implies this same distinction:

Package: A logical unit of digital content conforming to this specification. It contains a payload, Domain Objects describing the payload, a manifest of Domain Objects, and additional package level metadata.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/DataConservancy/dcs-packaging-tool/issues/5#issuecomment-226075975, or mute the thread https://github.com/notifications/unsubscribe/ABeSvDI4gn_AtCiPVNNX92LCO3HPi_x9ks5qL2q6gaJpZM4IWfwG.

birkland commented 8 years ago

So would a check box indicating your preference be sufficient?

tdilauro commented 8 years ago

Minimally, a checkbox and a canonical location for the domain object serializations (what is currently under data/obj/). In addition, would be useful to (1) add properties in bag-info.txt indicating both the payload and domain object locations and (2) modify the spec docs to capture these changes. Additionally, a configuration option and command-line parameter would be needed for the automated tool.

birkland commented 8 years ago

Hi Tim,

Can you be a little more specific about what is being proposed for inclusion in bagit.txt? Is it the user preference entered into the tool indicating the directory into which domain objects generated by the tool shall go? If so, it may not deserve mention in the spec, as it's just part of the internal function of one particular tool. There is another place in the bag used for storing PTG configuration, but I don't exactly remember where that is at the moment

The ReM manifest specifies the location for all resources considered to be domain objects, and is completely agnostic of any convention or policy of locating them (i.e. it could be in the payload section, outside the payload section, whatever. It only cares that they have URIs that can be resolved).

tdilauro commented 8 years ago

Hi @birkland,

My suggestion was for properties in bag-info.txt, rather than bagit.txt, but yes, you're right: There is probably no need to call out the default base location for the object serializations, since the locations of individual serializations can be extracted from META-INF/org.dataconservancy.packaging/PKG-INFO/ORE-REM/ORE-REM.ttl or whatever is pointed to by the already-specified Resource-Manifest bagi-info.txt property.

If we plan to support payload content in more than one location, then it would probably be a good idea to add a property in bag-info.txt that specfies where the payload can be found. For my current use cases, that property would always point to "data/", the canonical payload location in the existing BagIt specs.

birkland commented 8 years ago

The assumption the spec makes is that /data is the one and only payload directory - but the spec is written in a way that domain object resources don't necessarily have to be payload. Whether they are payload in the BagIt sense depends on if they are located in /data or not.

If the domain objects not in /data, then they are not payload, and not intended to be conveyed. They're just a specialized kind of metadata a client may safely choose to ignore.

If I'm understanding correctly, you and Ruth want a UI option in the PTG to allow the user to control where domain objects created by the PTG are put. Your viewpoint on whether domain objects are payload or not is implicit in your choice of location; anything not under /data is not payload. Does that sound about right?

tdilauro commented 8 years ago

While that is true, the PTG 1.0.x currently places a the content payload in /data/bin. But I want that content at the top level of /data when the domain object serializations are not in the bag payload. I'm suggesting a property in bag-info.txt because, while I WOULD be able get the locations of the domain object serializations by following the aforementioned Resource-Manifest bagi-info.txt property, I would NOT be able tell whether I should look in /data or /data/bin for the binary content.

The property wouldn't have to have as a value the directory location of the binary content, necessarily; it could be a flag that indicates which mode we're in. But I think the former would be a lot clearer and a lot more flexible in the long term.

emetsger commented 8 years ago

@tdilauro by virtue of the fact that you know you're dealing with a DC package, and you have a location of the ORE-ReM, you can parse the ReM to find the data, no matter where they are located (/data vs /data/bin), so do we really need to add a property? I would also add that you know what "mode" you are in by examining the package and determining whether or not the ORE-ReM is in the payload or not.

I'm wondering if the spec need to support this idea of a "mode" or is this just an implementation detail of the PTG.

tdilauro commented 8 years ago

@emetsger In the simple case, I think it should be possible to extract the payload from the package without having to know how to parse and understand the graph serializations.

emetsger commented 8 years ago

@tdilauro so in that instance you would examine the package, and determine that the ReM is not payload, so then you would expect to find payload under /data, or treat the payload as if everything were rooted under /data, right?

tdilauro commented 8 years ago

@emetsger The location of the ReM is given by the Resource-Manifest bagi-info.txt property and seems to default to META-INF/org.dataconservancy.packaging/PKG-INFO/ORE-REM/ORE-REM.ttl, so the ReM is not currently in the payload, at least not for bags produced by the PTG. It is only the domain object serializations that were in the payload, as far as I can tell. So using it's location to make such a determination would be problematic.

Would adding such a property be difficult or problematic, assuming that we proceed with this work, for other reasons that I'm possibly not grokking?

emetsger commented 8 years ago

@emetsger The location of the ReM is give by the Resource-Manifest bagi-info.txt property and seems to default to META-INF/org.dataconservancy.packaging/PKG-INFO/ORE-REM/ORE-REM.ttl, so the ReM is not currently in the payload, at least not for bags produced by the PTG. It is only the domain object serializations that were in the payload, as far as I can tell. So using it's location to make such a determination would be problematic.

Hah right!

Well, again, I think we have to decide if this is an implementation detail or does the spec need to be updated? Certainly the PTG could add any property it needs to a bag tag file, but it's a question of whether or not this is an issue that needs to be enumerated in the package spec.

tdilauro commented 8 years ago

My personal opinion is that we make the simple case(e.g., "I just want to pull my bytes out") easy and consistent, no matter what tool creates the package. That last part seems like the role of the spec, which should be about interoperability. If it's not in the spec, then there will be PTG packages and other tool packages that are incompatible.

birkland commented 8 years ago

Hm, I don't think I understand the problem? @tdilauro Is not the simple case "I just want to pull my bytes out" merely "grab all files out of /data" whilst ignoring everything else? @emetsger what are we trying to decide is an implementation detail vs in the spec? I don't think I understand what we're referring to any more.

tdilauro commented 8 years ago

@birkland It's not, because we already have packages that have been created in the current version 1.0.x format. In the future, I hope to have data in a different location. If I'm processing a package, I might need to know where my data (vs. my graph) is.

emetsger commented 8 years ago

@birkland I'm suggesting that our spec should not say anything about how the custodial content of the package is arranged or whether or not the graph is in the custodial portion of the bag.

On Monday, June 20, 2016, birkland notifications@github.com wrote:

Hm, I don't think I understand the problem? @tdilauro https://github.com/tdilauro Is not the simple case "I just want to pull my bytes out" merely "grab all files out of /data" whilst ignoring everything else? @emetsger https://github.com/emetsger what are we trying to decide is an implementation detail vs in the spec? I don't think I understand what we're referring to any more.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DataConservancy/dcs-packaging-tool/issues/5#issuecomment-227187278, or mute the thread https://github.com/notifications/unsubscribe/AAI-GraLl7BUMRI0dCuE3uhUsidGVBqfks5qNrmugaJpZM4IWfwG .

birkland commented 8 years ago

I see. For any package (created at any time, by any tool), we know that:1. The resources in /data are the custodial content (payload) of the bag

  1. The resources aggregated (ore:aggregates) by the manifest are your graph

So selecting the 'non-graph' payload is a matter of selecting everything out of /data, and removing anything aggregated by the manifest ReM. This is the general solution that will guarantee a correct answer without knowing any a priori knowledge about the bag or its structure.

The process of extracting non-graph data can be simplified (i.e. no parsing of manifest) if you know that a certain file path in the bag contains exactly all the non-graph data. For that reason, @tdilauro suggested a property to indicate the directory that exclusively contains all the non-graph data. @emetsger suggests that the spec should remain agnostic of such issues, and that defining such a property complicates the spec and implementation of tools. @birkland thinks the problem can be worked around by configuring the tool to place graph resources outside of /data, and tools that ignore the spec completely should be happy.