DataONEorg / slinky

Slinky, the DataONE Graph Store
Apache License 2.0
4 stars 4 forks source link

Represent individual files #7

Open ThomasThelen opened 3 years ago

ThomasThelen commented 3 years ago

We need a way to represent individual files. Right now, there isn't a great way to do this with SOSO.

One option is to represent them as individual nodes in the graph and connect the related dataset(s) to them. This would allow us to further annotate the nodes with additional information, such as which variables they describe.

Unknowns:

  1. The rdf:type of the node
  2. The predicate connecting the dataset to the file
mbjones commented 3 years ago

Earlier we discussed that individual files are in fact listed in SOSO guidelines using the schema:DataDownload class in the distribution property, which can be repeated for each data entity in a Dataset. The representation is like this:

...
  "distribution": {
    "@type": "DataDownload",
    "contentUrl": "https://www.sample-data-repository.org/dataset/472032.tsv",
    "encodingFormat": "text/tab-separated-values"
  }
...

To add additional properties, we could attach additional classes. In GeoLink, this was the geolink:DigitalObject, and in DCAT it would be dcat:Distribution. We could also use prov:Entity or provone:Data, which we use in or provenance graphs as well, so that would provide some level of alignment with other RDF that we would be serializing. What other candidates might there be for it?

ThomasThelen commented 3 years ago

Ah I ran across the following and blanked on the other section where DataDownload is used for other general files.

We use the schema:DataDownload class for Metadata files so that we can use the schema:MediaObject properties for describing bytesize, encoding, etc.

Our best bet is just sticking with schema:DatasetDownload. It has dcat:distribution as an equivalent class, so we should be able to use dcat terms where we appropriate

ThomasThelen commented 3 years ago

Do we want to talk about resource maps, system metadata, and science metadata as individual nodes in the graph? We certainly want to talk about EML documents-or at least link the download location as outlined here.

When it comes to science metadata imagine that we have a node in the graph describing a particular file, do we want to link its system metadata to it in some way?

Screen Shot 2021-04-07 at 12 36 27 PM

The diagram on top is a sketch of the file linked to its corresponding system metadata file node via an unknown term (hasSystemMetadata). This node in turn has properties about it (the things we typically find in system metadata documents like size, hash, name, etc).

The second diagram is a more abstract view where instead of representing the properties of the system metadata document we just encode where to download it (questionable use of distribution).

amoeba commented 3 years ago

That a file (schema:DataDownload) has System Metadata record isn't really scientifically interesting, nor is the fact that it's got a Resource Map Object describing/aggregating it. The approach we have in place now pulls properties from various DataONE types but aggregates them as properties on either the schema:Dataset or schema:DataDownload. The ore:aggregates triples in the ORE are represented as schema:distribution triples. I'm swapping schema terms in here for geolink terms since the structure is the same.

This structure allows us to answer questions like "give me download URLs for all the NetCDF files under 2MiB in size funded by NSF that contain measurements of carbon dioxide flux and were collected before the year 2000 north of 45° latitude". A structure that more closely follows DataONE's data model doesn't prohibit this but the queries are more fun to write.

Do we have a use case where having a more DataONE-centric model in the triplestore would be useful? I think for the SASP work, no. But longer term, maybe yes.

ThomasThelen commented 3 years ago

What had me thinking about this was this issue which could mean a number of things though... I think my statement was probably a pondering of how much we're modeling the actual structure of DataONE (ORE aggregating files, each having an associated system metadata document, etc) next to the SOSO representation. But it makes sense to aggregate a number of the terms in the system metadata into the actual node representing the file.

So files are represented as first class nodes? ie for every file in DataONE, theres a node of type schema:DataDownload and one or more nodes of type schema:DataSet are referencing them via schema:distribution?

Screen Shot 2021-04-08 at 9 07 18 AM

I don't see an immediate use case for representing the system metadata as a node if there's a mapping between the interesting metadata fields in the system metadata and schema,org terms with domain schema:DataDownload.

While thinking about how to close this issue it seems that the work needed is...

  1. Document which terms we want to save from the system metadata document (probably mostly already done in the current implementation)
  2. Come up with and document a mapping between the terms from step 1 and the schema,org vocabulary
  3. Implement it in the parser
amoeba commented 3 years ago

Yep, that's what I'm working on over on https://github.com/DataONEorg/slinky/pull/21 at the moment. Once that's done, we can do step 3.

mbjones commented 3 years ago

This approach sounds good to me. I've never been very happy with the semantics of schema:DataDownload, but agree we should use it to represent the files. However, we could also benefit it by cross typing it with one other common vocabulary that has the concept of a data entity, mainly prov:Entity (or maybe provone:Data?). Are there other candidates other than PROV for this common concept to replace geolink:DigitalObject?

amoeba commented 3 years ago

However, we could also benefit it by cross typing it with one other common vocabulary that has the concept of a data entity....

I agree. cross typing with prov:Entity, provone:Data and probably even dcat:Distribution seem both useful and logically correct. Considering that we're inserting into a graph, does it make more sense to (1) explicitly assert each type on each dataset or (2) just toss a few owl:equivalentClass statements in when the graph is constructed?

Option (1) is nice because simple SPARQL queries will surface the equivalence but has the downside of massively increasing the number of triples in the triplestore. (2) feels more "use the force, Luke". We could cross-type when serializing out to JSON-LD even with option (2).