Add schema.org markup on datasets page

pzwsk commented 3 years ago

More info on how-to here https://ai.googleblog.com/2018/09/building-google-dataset-search-and.html

matamadio commented 3 years ago

NOTE: Schema.org's Dataset vocabulary was originally based on DCAT, which in turn used Dublin Core and FOAF terms. JKAN is based on DCAT schema.

ConnectedSystems commented 3 years ago

@cgiovando

Would you know if search engines are able to understand embedded DCAT vocabularies (as they are implemented in JKAN in particular)?

It seems there are mappings between DCAT and Schema.org already (or at least subsets of DCAT, see here)

Would embedding Schema.org metadata alongside DCAT bring about any further enhancements?

pzwsk commented 3 years ago

Hi @ConnectedSystems if you are talking about web search engine you might be interested in reading the article below:

https://www.blog.google/products/search/making-it-easier-discover-datasets/ https://developers.google.com/search/docs/data-types/dataset#approach

ConnectedSystems commented 3 years ago

Hi @pzwsk

Yes, thank you. The first link had the information I was after.

Here it says:

"We can understand structured data in Web pages about datasets, using either schema.org Dataset markup, or equivalent structures represented in W3C's Data Catalog Vocabulary (DCAT) format"

Given JKAN already embeds DCAT markup, I'm hesitant to add Schema.org markup on top of it (it will be a lot of time and effort to do so), hence why I ask about the advantages/enhancements adding Schema.org would bring.

That said, when I say "embeds DCAT", this is true only for the built-in JKAN fields. For instance, I have not embedded DCAT markup alongside the information in custom tables (e.g., fields under "RDL Hazard Info" and "Additional Info" on this page).

I could add DCAT markup or Schema.org markup to these fields, but again, hesitant to do both.

pzwsk commented 3 years ago

Thanks, Taku, not sure either there is a clear need at this stage to go further. Next step is contact potential platforms that would harvest us (WB data hub, google dataset search, etc.)

At least we can and should put in the documentation of our JKAN instance that core metadata are exposed in DCAT format.

Best,

On Mon, Mar 15, 2021 at 11:00 AM Takuya Iwanaga @.***> wrote:

Hi @pzwsk https://github.com/pzwsk

Yes, thank you. The first link had the information I was after.

Here https://developers.google.com/search/docs/data-types/dataset#approach it says:

"We can understand structured data in Web pages about datasets, using either schema.org Dataset markup, or equivalent structures represented in W3C's Data Catalog Vocabulary (DCAT) format"

Given JKAN already embeds DCAT markup, I'm hesitant to add Schema.org markup on top of it (it will be a lot of time and effort to do so), hence why I ask about the advantages/enhancements adding Schema.org would bring.

That said, when I say "embeds DCAT", this is true only for the built-in JKAN fields. For instance, I have not embedded DCAT markup alongside the information in custom tables (e.g., fields under "RDL Hazard Info" and "Additional Info" on this page http://jkan.riskdatalibrary.org/datasets/hzd-afg-dr/).

I could add DCAT markup or Schema.org markup to these fields, but again, hesitant to do both.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/GFDRR/rdl-jkan/issues/12#issuecomment-799287017, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASEVRYWL5JZEXI6JEFE2TTTDXLENANCNFSM4XWG5IFA .

ldodds commented 3 years ago

Google dataset search does support both DCAT and Schema.org although they recommend the latter.

I had a look at the DCAT embedded in the JKAN pages, using this RDFa extractor it seems to parse fine. Although the OGP properties are mixed in with the Dataset.

So it should be possible for Google, at least, to index the pages in JKAN. I can't see the site in Google search, but they might not have harvested yet. If that covers the core requirement, then perhaps we don't need the Schema.org markup as well?

Looking at the World Bank Data Hub they are embedding both sets of metadata. They're using RDFa to provide the DCAT metadata (as we are with JKAN). They'e usings a variety of extra schemas to (see example output).

To provide the Schema.org metadata they're using an embedded JSON-LD block:

<script type="application/ld+json">...</script>

This probably simplifies things as it avoids having to add both set of properties across the HTML page. But still requires a conversion of the metadata to the other format.

ldodds commented 3 years ago

In reviewing this I've noticed some bugs in the DCAT metadata, as parsed by the RDFa extractor linked above:

I don't think the metadata download should be marked up as a distribution, as it doesn't contain the data.
the title property seems to be picking up the titles of the distributions, not the dataset title. That might be related to the above
the OGP description property should have same value as the dataset description?

ConnectedSystems commented 3 years ago

Thanks for the review @ldodds

Would this be resolved by having a dedicated endpoint (#14) ?

Otherwise:

I don't think the metadata download should be marked up as a distribution, as it doesn't contain the data.

For clarity, this was the default behavior for files exposed via JKAN which I copied when modifying for RDL. But I guess this goes back to how "data" is defined/framed. From my perspective this metadata is data describing the dataset, and is made available as a distributed resource.

But semantics aside, would dcat:CatalogRecord be more acceptable? (I suspect not but trying to find a suitable alternative).

the title property seems to be picking up the titles of the distributions, not the dataset title. That might be related to the above

Sorry, I am missing something here.

If we take this entry as an example, the title of the dataset is "Afghanistan agriculture", and the given resource name (the distribution) matches.

Are you suggesting that the distribution should match the resource filename, or otherwise made different from the resource name?

the OGP description property should have same value as the dataset description?

Assuming this is the Open Graph Protocol, I suggest we disable the OGP feature. The base JKAN implementation is set up to include OGP at the page level (hence why you're seeing different values), and the only configuration provided is "on or off".

Modifying this to be more configurable is a much larger body of work.

matamadio commented 3 years ago

Currently we have a general title and description for the whole dataset page, and specific titles descriptions for each of the resources/distributions (shown on "details"). Related to what was said in https://github.com/GFDRR/rdl-standard/issues/7, would it be better to split resources and have more univocal/precise metadata? Ie. one dataset page > one distribution.

ldodds commented 3 years ago

A Distribution is "A specific representation of a dataset. A dataset might be available in multiple serializations that may differ in various ways, including natural language, media-type or format, schematic organization, temporal and spatial resolution, level of detail or profiles".

The metadata is a description of the dataset, rather than a distribution of it. So it doesn't really fit with serving it as a Distribution in my opinion. A Catalog Record is closer perhaps but also doesn't look quite right.

Portals sometimes have link to download the dataset metadata (which is displayed and embedded into the page) in different formats, e.g. on WB DH: "The information on this page (the dataset metadata) is also available in these formats...". But that's different to the resources associated with the dataset.

re: "the title property. I just meant that there's something wrong with the embedded RDFa markup. When extracting the metadata from the page, the dcterms:title property ends up with two values: "Afghanistan agriculture" and "Metadata". There's also only a single Distribution with the title of "Afghanistan agriculture". So something not quite right in there, but I've not identified the source of the problem.

I've included a screenshot. I used the Structured Data Sniffer extension to try and show it.

Screenshot from 2021-04-21 13-57-22

Hope that helps.

matamadio commented 3 years ago

Thanks, if I understand correctly:

different distributions should always refer to the same dataset (different version or language), and then should not be used for different subsets (e.g. one distribution is residential exposure and another one is industrial exposure);
those should be distinct datasets instead

ldodds commented 3 years ago

@matamadio broadly yes. Sometimes you might split a large dataset over multiple files in different ways. I think its legitimate to have those as separate distributions associated with the same dataset. The most common case is usually a single distribution per dataset.

My rule of thumb is that if there's any differences in the provenance or governance of the data (e.g. its produced by a different process, or by a different organisation, or has different licensing) then it's a different dataset and will have its own distribution(s).

matamadio commented 3 years ago

Alright, then I will need to split several sets after the schema update.

ConnectedSystems commented 3 years ago

Hi @ldodds

re: "the title property. I just meant that there's something wrong with the embedded RDFa markup. When extracting the metadata from the page, the dcterms:title property ends up with two values: "Afghanistan agriculture" and "Metadata".

I think the reason for this is because the base JKAN template assigns a dcterms:title property for each associated resource entry (and I subsequently used as a basis for the RDL metadata file). At the same time, JKAN assigns a dcat:Dataset property for each file resource, but does not assign a distribution tag at all, so it seems all properties get lumped together with the parent, page-level, specifications (hence why all the dcterms:title tags get lumped together).

The implemented approach may not align with DCAT completely either, as the DCAT v2 documentation appears to suggest that Datasets can represent collections of Distributions (as per your statement above re definition of "Distribution").

I've tentatively adjusted the JKAN template (only on local dev) such that dcat:Dataset property is provided on a once-per-page basis, with all resources listed therein marked as dcat:Distribution.

In this way, each dcterms:title gets associated with a Distribution.

The metadata is a description of the dataset, rather than a distribution of it. So it doesn't really fit with serving it as a Distribution in my opinion. A Catalog Record is closer perhaps but also doesn't look quite right.

If I've interpreted the DCAT v2 documentation example correctly (and good chance I haven't) then the Distribution type can also be used for accompanying metadata, as given in the example linked/shown below

dcat:distribution [
      rdf:type dcat:Distribution ;
      dct:title "RDF/XML representation of the ontology used for the data"@en ;
      dcat:downloadURL <http://resource.geosciml.org/ontology/timescale/gts.rdf> ;
      dcat:mediaType <https://www.iana.org/assignments/media-types/application/rdf+xml> ;
]

https://www.w3.org/TR/vocab-dcat-2/#ex-elaborated-bag

I've also updated the dcat:accessURL for Distributions and rdl-metadata files to dcat:downloadURL. Again these changes are only made locally until we are in agreement.

GFDRR / rdl-jkan

Add schema.org markup on datasets page #12