Proposal: How do we link publications to datasets?

ashepherd commented 6 years ago

Relevant schema.org/Dataset Properties

citation | CreativeWork or Text | A citation or reference to another creative work, such as another publication, web page, scholarly article, etc.
hasPart | CreativeWork | Indicates a CreativeWork that is (in some sense) a part of this CreativeWork.Inverse property: isPartOf.
isPartOf | CreativeWork | Indicates a CreativeWork that this CreativeWork is (in some sense) part of.Inverse property: hasPart.
subjectOf | CreativeWork or Event | A CreativeWork or Event about this Thing..Inverse property: about.
about | Thing | The subject matter of the content.Inverse property: subjectOf.

Query Patterns

  SELECT ?creative_work ?related_type ?related_citation ?related_id_scheme ?related_id_value
  WHERE {
    VALUES ?dataset {  }
    VALUES ?dataset { schema:citation schema:hasPart schema:isPartOf schema:subjectOf schema:about }
    ?dataset ?relation ?creative_work .
    ?creative_work rdf:type ?related_type .
    ?creative_work schema:identifier ?related_citation .
    ?creative_work schema:identifier ?identifier .
    ?identifier schema:value ?related_id_value .
    ?identifier schema:propertyID ?related_id_scheme .
}

?related_type is unbound, but we might want to constrain down to certain types like: Dataset, ScholarlyArticle, etc.
?related_citation might be another schema:CreativeWork which can be iterated on for follow the network of links.
?identifier looking for a legit, proper ID with schema and value (e.g. scheme: "datacite:doi", value: "10.1234/56789")

ashepherd commented 6 years ago

There might be some work by the registry to re-interpret the harvested triples into a cleaned up set even to make SPARQL querying easier. resolving what property was used to name the thing, does it have a url or do we need to build one with the DOI, stuff like that.

mike-iris commented 6 years ago

As a brand new observer, with my software engineering hat on and past projects that did not always go exactly as planned ;) The phrase "to re-interpret the harvested triples" is a little bit of an orange flag.

If looking at the data and creating data structures can be thought of as bottom-up, then defining the minimum, required queries (i.e. the API) that must be delivered by the system could be considered a top-down view of the system.

With this in mind, it can be more efficient for a project to reach agreement on an initial API, then build only enough middle-wear (i.e. harvesting and storage) and data structures to deliver the respective API. Getting the data structures is very important, but only looking at the data structures can lead to adding complexity before its time and also makes it difficult to understand any performance problems that show up in a running system.

Additionally, this framework can then be used to establish a functioning platform that can be used as a basis for an iterative development processes and demonstration of capabilities of the system. In my experience, driving middle-wear change by both desired data content and desired API will help focus effort where it is needed most and help get the right amount of complexity and code for each iteration.

ashepherd commented 6 years ago

maybe that wasn't the right phrase, @mike-iris. What I meant by this is simply to structure the data for efficient querying similar to what would do with a Lucene index or JSON-LD framing or ontology mapping. Doing so, would disambiguate the data type and improve performance.

Thanks for your architecture thoughts. On the APIs, you can find that work here: https://github.com/earthcubearchitecture-project418/services

How is the publishing going for IRIS, @mike-iris. Do you need help?

mike-iris commented 6 years ago

@ashepherd, we plan to have an internal meeting this week to get things going, Base on our conversation in the other issue thread, I will not worry about this page http://www.iris.edu/hq/sitemap. I will suggest we will use this page, http://ds.iris.edu/ds/, to contain an "@type": ["Service", "Organization"] structure, and locate a sitemap.xml there. The sitemap.xml will refer to 1 or a few pages containing Data type structure.

ashepherd commented 6 years ago

@mike-iris, that's great! Anything I can do to help strategize on how to publish, don't hesitate to reach out. It'll be really cool to see IRIS data endpoints showing up.

Your plan for the sitemap is great. We made a way for you to describe this for cases similar to yours. You can find that here: http://geodex.org/voc/documentation#repository-services and in this code snippet:

{
       "@type": "ServiceChannel",
       "serviceUrl": "https://www.sample-data-repository.org/sitemap.xml",
       "providesService": {
         "@type": "Service",
         "additionalType": "gdx:SyndicationService",
         "name": "Sitemap XML",
         "description": "A Sitemap XML providing access to all of the resources for harvesting",
         "potentialAction": {
           "@type": "ConsumeAction",
           "target": {
             "@type": "EntryPoint",
             "additionalType": "gdx:SitemapXML",
             "urlTemplate": "https://www.sample-data-repository.org/sitemap.xml?page={page}"
           },
           "object": {
             "@type": "DigitalDocument",
             "url": "https://www.sample-data-repository.org/sitemap.xml",
             "fileFormat": "application/xml"
           }
         }
       }
     }

If you don't have a sitemap with multiple pages, you can make the value of schema:target to be a URL instead of a schema:EntryPoint:

...
"target": "https://www.sample-data-repository.org/sitemap.xml"
...

earthcubearchitecture-project418 / p418Vocabulary

Proposal: How do we link publications to datasets? #11