As a researcher, I want more dataset metadata in schema.org exports so that my data is more discoverable

jggautier commented 6 years ago

In issue #2243, some metadata fields important for dataset discovery were excluded from mapping to Schema.org. We said we'd include them in a later issue. This is that issue, and these are those fields (dun dun):

Creator types (person or organization) for dataset authors (a separate ticket has been opened for this, https://github.com/IQSS/dataverse/issues/5029, since the missing creator types are making Google's Dataset Search engine not display creator names and there are UI implications)
Dataset identifier (DOI or HDL as a URL) using the @id property
Dataset identifier (DOI or HDL as a URL) using the url property
Name of funding source (that's all schema.org supports for now; how to include other funding source details (like grant numbers) are discussed in this github issue in schema.org's repo)
Author identifiers
Geographic coverage
Multiple dataset descriptions
File metadata (see schema.org from Zenodo and from ICPSR for an example of how these fields are used)
- File PID
- File download URL (when there is one - excludes restricted files and files in datasets with guestbooks)
- File name
- File description
- File format

We'll also need to fix:

The property used for dataset authors. Dataverse is using author, but I think Google Dataset Search is ignoring author and prefers creator. (See this comment on #5029)
Keywords and Topic Classifications (Dataverse concatenates each keyword into one value, and each topic classification into one value; see this dataset's schema.org metadata)
Provider: Change the value to use the installation name (instead of hardcoding "Dataverse")
- For the provider property we hardcode "Dataverse" and put the installation name in the DataCatalog name property, but Dataset Search is displaying a "Data provided by" field and is using what's in the provider property.

Which fields are added to the Schema.org metadata template (draft) and how they're mapped will probably be adjusted after community discussion (within Dataverse community and hopefully with a proposed RDA group focused on ways to make data more discoverable by search engines).

@scolapasta asked me to add to the definition of done that we should make sure that the methods used to pull metadata values from different fields into different exports (DDI, DC, DataCite, Schema.org, native JSON (?)) are consistent.

jggautier commented 6 years ago

After a discussion in today's Community Call about sending DataCite file-level metadata that includes the file checksum, @mercecrosas added in this Google Groups comment a table recommending ways to map to schema.org more dataset.

The table includes more metadata fields that have been added.

An open question, that might deserve its own github issue, is if Dataverse should produce schema.org metadata at the file level.

djbrooke commented 5 years ago

Let's exclude File Download URL for now. It can follow on in a separate issue.

pameyer commented 5 years ago

Thanks to @jggautier I was able to track down some tools for validating schema.org JSON-LD. https://github.com/jessedc/ajv-cli can be used from the command line to validate the JSON against a schema ; after using https://github.com/scrapinghub/extruct to retrieve the JSON-LD from the generated html/xhtml.

pdurbin commented 5 years ago

@jggautier and I just compared notes for 90 minutes and gave ourselves some todo items:

schemaimg_20181015_163814258

jggautier commented 5 years ago

The gist of the discussion was about how to add more metadata in schema.org while both making changes that are as backwards compatible as possible and making sure that consumers (people in charge of other systems) of the metadata are aware that because it's an evolving/unstable schema, there's a good chance that changes will affect how they consume and use this metadata.

The list kept in this Github issue's description is a wishlist of metadata we'd like to expose in schema.org, and getting into the weeds about how schema.org can be used to make more metadata discoverable is helping us understand issues with current recommendations and implementations, including Dataverse's implementation. When the changes we want to make are different from or extend Google's suggestions and other repositories' implementations, we need to weigh:

the urgency of making those changes (e.g. we need Google to index dataset authors)
the influence Dataverse can have over recommendations that everyone can get behind (e.g. by starting to publish thousands of datasets using schema.org properties and json structures that no one else is using, we can influence the direction of those recommendations)
the probability that, as stronger recommendations require us to change the way we use schema.org and json, other systems that use that metadata will need to change

I'm proposing that we weigh those three things as we continue to add more metadata. We won't add it all for this issue, we'll take what we learn to the groups that are building consensus (schema.org community, RDA, Bioschemas, etc.), and continue to make additions/changes when there's a strong need to and/or when there are well-supported recommendations.

My next steps are to:

[x] consider those three things and propose a (probably smaller) template for this issue
[x] help @pdurbin figure out a bug where some dataset citations aren't showing up in schema.org for published datasets (e.g. citation shows up in the schema.org metadata of this dataset but not this dataset)
- The schema.org property "citation" is (correctly) mapped to Dataverse's "Related Publication" citation field, not the dataset citation...

pdurbin commented 5 years ago

In pull request #5169 I'm ready for code review and discussion as of c27f10b

Here are the questions I have:

Do people like all the changes I made to the guides? Please note that I'm documenting backward incompatible changes.
Does the long Javadoc comment in SchemaDotOrgExporter make sense?
Should we remove the schema version? https://schema.org/version/3.4/ says, "Note that schema.org release numbers are not generally included when you use schema.org."
Do we want "url" in there? It was removed from Julian's v4 but I'm not sure if that was intentional or not.
We're adding "publisher" to be identical to "provider". Is adding "publisher" an experiment similar to how we're adding "creator"?
Are we sure that we want "description" to be multivalued. No other implementation seems to do this. What is a consumer of multi values going to do with the extra values beyond the first one?

I'd also like to note a few things:

I spoke with @pameyer a bit about the concerns he voiced during sprint planning about Glassfish falling over due to too many downloads which lead to the note by @djbrooke about excluding file download URLs. Dataverse is a data sharing platform and despite the concerns that Harvard Dataverse has over stability or Amazon bills or whatever, groups that want to integrate with Dataverse are interested in a way to download files based on DOIs (or Handles) and seemed excited to learn that Zenodo provides URLs to each of their files in Schema.org JSON-LD ( https://github.com/whole-tale/whole-tale/issues/35#issuecomment-427411937 ) and I think Dataverse should provide this ability as well. Without a standard like this, the alternative is to explain to each group how the Dataverse APIs work (i.e. https://github.com/whole-tale/dashboard/issues/269#issuecomment-421413820 ). So I went ahead and added file download URLs but I pass them through our isPubliclyDownloadable filter and I provide a way to turn them off completely with a new JVM option called dataverse.files.hide-schema-dot-org-download-urls. (Another potential standard way to retrieve files is via SWORD, "section 6.4. - Retrieving the content" of the spec, but we never implemented it in #183 and I'm not sure we eer will.)
I implemented @id and identifier in "citation" to get the value from publicationURL rather than trying to figure it out from a combination of publicationIDType and publicationIDNumber. This made more sense to me because the value needs to be a URL anyway and this way we can support more than DOI.
It's a little weird that the English word "Funder" is required in the Java code to get the right contributor type but I don't see another way around this. See 043080d.
Google provides a JSON-LD example for a dataset at https://developers.google.com/search/docs/data-types/dataset
Here is Julian's v4 I was working off of: julian-v4.json.txt
Here's how the Schema.org JSON-LD output looks (CC0 example). I can add this to the guides if it would be helpful: pdurbin-c27f10b.json.txt

Moving to code review.

pameyer commented 5 years ago

One thing to add regarding the discussions w\ @pdurbin about the download URLs and glassfish stability is that I did a little investigating; and by themselves large numbers of calls to the download API do not appear likely to cause glassfish instability (distinct from any latency to other dataverse pages/endpoints).

djbrooke commented 5 years ago

re: Download URLs putting it as a configurable option sounds like a good solution, thanks!

sekmiller commented 5 years ago

With respect to "Funder" question, depending on how the internationalization of Controlled Vocabulary is handled, we might be able to take advantage of the "identifier" field on the ControlledVocabularyValue table. The intended use for "identifier is to allow for the changing of the display value via the api.

sekmiller commented 5 years ago

Also, as we discussed, consider moving the creation of the json to the Exporter service bean (and making the internal methods such as getFunders() private methods)

pdurbin commented 5 years ago

As I mentioned at standup there are still many unanswered questions in my long comment above at https://github.com/IQSS/dataverse/issues/4371#issuecomment-433525903 so I'm moving this back to code review.

Meanwhile, I implement the solution suggested above by @sekmiller to remove the English word "Funder" from the backend logic in 5c2ed685f. I updated the citation.tsv to add identifiers for the contributorType controlled vocabulary values. I'll note that @qqmyers express some concern about making the reloading of the citation.tsv mandatory at https://github.com/IQSS/dataverse/commit/5c2ed685f66629989fc93c3bbf68dd22453fbaa1#r31124792

In 5c2ed685f I also implemented a doc change to explain an experiment I'd like to conduct with regard to release notes. Here are the goals (also in the commit):

As a developer, I want to express in my pull request when an addition to the release notes will be necessary.
As a developer, I want to be aware of changes that should be made to my dev environment after a pull request has been merged. I already know to look in scripts/database/upgrades if I pull the latest code from the "develop" branch for updates as described in doc/sphinx-guides/source/developers/sql-upgrade-scripts.rst but I want a place to look for non-SQL updates that are required. These could be Solr schema changes or curl commands to reload metadata blocks, for example.

I ran this idea by @sekmiller and @qqmyers seems to like it as well: https://github.com/IQSS/dataverse/commit/5c2ed685f66629989fc93c3bbf68dd22453fbaa1#commitcomment-31124844

scolapasta commented 5 years ago

Going through the review, there will be a few things, but wanted to first start with Funder issue. I understand not wanting to use the value of 'Funder" to get this, but as @pdurbin eluded this is changing in the internationalization branch. Meaning what will be stored in the db will be the key (in English) and not the display value, which will be in bundles and translated. So it should be fine to use the value from the db to check this.

(additionally, identifier is meant to be as an external identifier for those controlled vocabulary values that we get from other sources; i.e. I'd opt for leaving them blank when not from external, considering we'll have the key above to use)

So I suggest removing these new identifiers and using the key (which is the English text currently, until internationalization issues is checked in). This also means we can remove the "experimental" changes for release notes to a separate branch / issue, where we can review as a group if this is the process we want going forward.

scolapasta commented 5 years ago

I also am wary of including so much specific logic in the code about specific identifier types (validation and url generation). It makes it impossible to add a new identifier type without changing the code. What we should want to consider is adding something to the definition of these controlled vocabulary values (similar to what we have for dataset field types) to store these format schemes and validations in the db.

We may decide that we're fine taking this as technical debt and create a new issue to scope this out, but ideally, this would have been discussed before this issue got to code review.

scolapasta commented 5 years ago

As DatasetVersion gets more and more helper methods to get specific dataset fields, it may make sense to consider extracting out these "helper" methods to something else. For example a DatasetFieldHelper class. This would keep the code more maintainable in the future. (and likely could have made sense to do earlier, but originally it was only a few fields for which we had "helper" methods; with schema.org we've added quite a few more it seems).

scolapasta commented 5 years ago

Re: File download urls - can we not have an option? Either decide to include them now or decide not to, but not add the jvm option? I think it's good to use the isPubliclyDownloadable filter (which when we get around to decoupling how we store guestbook responses / downloads won't be as needed).

pdurbin commented 5 years ago

I just made the three changes @djbrooke @scolapasta and I agreed on:

switch logic back to using the English word "Funder"
back out of the release notes
flag the JVM option as experimental

We're down to just the questions above I have for @jggautier so I assigned this to him.

scolapasta commented 5 years ago

Added #5271 and #5272.

Since we decided to defer #5271 and leave the logic in the code, it would be cleaner for the validate methods to be one generic method rather than multiple.

So you can pass the reg expression to the method as a parameter. In this way we the validate method won't have to change for #5271.

pdurbin commented 5 years ago

I just made the changes @jggautier @scolapasta @djbrooke @kcondon and I agreed to after standup:

3a95edf07 remove "schemaVersion" from output #4371
50bc8caf5 remove "url" from output #4371
882cbfb38 clarify publisher vs. provider comment #4371
7cd56221a single "isValidAuthorIdentifier" method, pass in regex #4371

kcondon commented 5 years ago

So, aside from internal code restructuring, this pr: -adds new fields to schema.org (create/export dataset, verify against list julian provides) -changes the structure of some fields in schema.org (multiple, object type) (same as above, add multiple where appropriate, paste into Google validation tool) -adds optional hide files jvm option to block download urls in export (verify on/off behavior and pubic/restricted file behavior) -publisher and provider will be the instance name (root dv) (verify against export)

kcondon commented 5 years ago

Issues/questions:

files/distribution section contains additional, unspecified info: file name, pid.
author id is missing if value is entered in a nonconforming format but not indication exists what the conforming format is.

Discussed above with Julian and he will complete review. Will discuss with Julian and Phil to see what needs to be addressed.

jggautier commented 5 years ago

Another issue:

The URLs for the related publications show up wrapped in html:

"citation": [
{
  "@type": "CreativeWork",
  "text": "Related pub citation 1",
  "@id": "<a href=\"https://doi.org/10.7910/DVN/P7EVGF\" target=\"_blank\">https://doi.org/10.7910/DVN/P7EVGF</a>",
  "identifier": "<a href=\"https://doi.org/10.7910/DVN/P7EVGF\" target=\"_blank\">https://doi.org/10.7910/DVN/P7EVGF</a>"

jggautier commented 5 years ago

@id is missing from the files/distribution section @kcondon mentioned in his comment (@pdurbin and I agreed to keep the extra file info.) We're always using @id whenever identifier is used. I discussed with @pdurbin and he'll update.

"distribution": [
    {
      "@type": "DataDownload",
      "name": "file1.txt",
      "fileFormat": "text/plain",
      "contentSize": 26,
      "description": "File description 1",
      "identifier": "https://hdl.handle.net/20.500.12050/FK2/TWFVRE/222222",
      "@id": "https://hdl.handle.net/20.500.12050/FK2/TWFVRE/222222",
      "contentUrl": "https://demo.dataverse.org/api/access/datafile/:persistentId?persistentId=doi:10.5072/FK2/CFWNSH/ZEHFD0"

(contentUrl should appear only when the installation indicates that they want download URLs appearing in their schema.org exports.)

pdurbin commented 5 years ago

Yep, I got rid of the "href" stuff in fcae94e and added @id at the file level in 0e0b55d.

jggautier commented 5 years ago

I looked at the schema.org export and all four issues are resolved!

@kcondon noticed that contentUrl isn't showing up in the schema.org export of a test dataset, although we expect it to. (It's the dataset titled "Test Schema Org Julian 5 Schema" on the "internal" test instance.)

pdurbin commented 5 years ago

For the record, as discussed with @kcondon and @jggautier , the FileUtil.isPubliclyDownloadable logic is used to contentUrl wasn't being shown because the dataset had terms of use. It also checks for guestbooks. Both of these require a popup to agree to or fill out in the UI.

IQSS / dataverse

As a researcher, I want more dataset metadata in schema.org exports so that my data is more discoverable #4371