IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
873 stars 484 forks source link

As a researcher, I want more dataset metadata in schema.org exports so that my data is more discoverable #4371

Closed jggautier closed 5 years ago

jggautier commented 6 years ago

In issue #2243, some metadata fields important for dataset discovery were excluded from mapping to Schema.org. We said we'd include them in a later issue. This is that issue, and these are those fields (dun dun):

We'll also need to fix:

Which fields are added to the Schema.org metadata template (draft) and how they're mapped will probably be adjusted after community discussion (within Dataverse community and hopefully with a proposed RDA group focused on ways to make data more discoverable by search engines).

@scolapasta asked me to add to the definition of done that we should make sure that the methods used to pull metadata values from different fields into different exports (DDI, DC, DataCite, Schema.org, native JSON (?)) are consistent.

jggautier commented 6 years ago

After a discussion in today's Community Call about sending DataCite file-level metadata that includes the file checksum, @mercecrosas added in this Google Groups comment a table recommending ways to map to schema.org more dataset.

The table includes more metadata fields that have been added.

An open question, that might deserve its own github issue, is if Dataverse should produce schema.org metadata at the file level.

djbrooke commented 5 years ago

Let's exclude File Download URL for now. It can follow on in a separate issue.

pameyer commented 5 years ago

Thanks to @jggautier I was able to track down some tools for validating schema.org JSON-LD. https://github.com/jessedc/ajv-cli can be used from the command line to validate the JSON against a schema ; after using https://github.com/scrapinghub/extruct to retrieve the JSON-LD from the generated html/xhtml.

pdurbin commented 5 years ago

@jggautier and I just compared notes for 90 minutes and gave ourselves some todo items:

schemaimg_20181015_163814258

jggautier commented 5 years ago

The gist of the discussion was about how to add more metadata in schema.org while both making changes that are as backwards compatible as possible and making sure that consumers (people in charge of other systems) of the metadata are aware that because it's an evolving/unstable schema, there's a good chance that changes will affect how they consume and use this metadata.

The list kept in this Github issue's description is a wishlist of metadata we'd like to expose in schema.org, and getting into the weeds about how schema.org can be used to make more metadata discoverable is helping us understand issues with current recommendations and implementations, including Dataverse's implementation. When the changes we want to make are different from or extend Google's suggestions and other repositories' implementations, we need to weigh:

I'm proposing that we weigh those three things as we continue to add more metadata. We won't add it all for this issue, we'll take what we learn to the groups that are building consensus (schema.org community, RDA, Bioschemas, etc.), and continue to make additions/changes when there's a strong need to and/or when there are well-supported recommendations.

My next steps are to:

pdurbin commented 5 years ago

In pull request #5169 I'm ready for code review and discussion as of c27f10b

Here are the questions I have:

I'd also like to note a few things:

Moving to code review.

pameyer commented 5 years ago

One thing to add regarding the discussions w\ @pdurbin about the download URLs and glassfish stability is that I did a little investigating; and by themselves large numbers of calls to the download API do not appear likely to cause glassfish instability (distinct from any latency to other dataverse pages/endpoints).

djbrooke commented 5 years ago

re: Download URLs putting it as a configurable option sounds like a good solution, thanks!

sekmiller commented 5 years ago

With respect to "Funder" question, depending on how the internationalization of Controlled Vocabulary is handled, we might be able to take advantage of the "identifier" field on the ControlledVocabularyValue table. The intended use for "identifier is to allow for the changing of the display value via the api.

sekmiller commented 5 years ago

Also, as we discussed, consider moving the creation of the json to the Exporter service bean (and making the internal methods such as getFunders() private methods)

pdurbin commented 5 years ago

As I mentioned at standup there are still many unanswered questions in my long comment above at https://github.com/IQSS/dataverse/issues/4371#issuecomment-433525903 so I'm moving this back to code review.

Meanwhile, I implement the solution suggested above by @sekmiller to remove the English word "Funder" from the backend logic in 5c2ed685f. I updated the citation.tsv to add identifiers for the contributorType controlled vocabulary values. I'll note that @qqmyers express some concern about making the reloading of the citation.tsv mandatory at https://github.com/IQSS/dataverse/commit/5c2ed685f66629989fc93c3bbf68dd22453fbaa1#r31124792

In 5c2ed685f I also implemented a doc change to explain an experiment I'd like to conduct with regard to release notes. Here are the goals (also in the commit):

I ran this idea by @sekmiller and @qqmyers seems to like it as well: https://github.com/IQSS/dataverse/commit/5c2ed685f66629989fc93c3bbf68dd22453fbaa1#commitcomment-31124844

scolapasta commented 5 years ago

Going through the review, there will be a few things, but wanted to first start with Funder issue. I understand not wanting to use the value of 'Funder" to get this, but as @pdurbin eluded this is changing in the internationalization branch. Meaning what will be stored in the db will be the key (in English) and not the display value, which will be in bundles and translated. So it should be fine to use the value from the db to check this.

(additionally, identifier is meant to be as an external identifier for those controlled vocabulary values that we get from other sources; i.e. I'd opt for leaving them blank when not from external, considering we'll have the key above to use)

So I suggest removing these new identifiers and using the key (which is the English text currently, until internationalization issues is checked in). This also means we can remove the "experimental" changes for release notes to a separate branch / issue, where we can review as a group if this is the process we want going forward.

scolapasta commented 5 years ago

I also am wary of including so much specific logic in the code about specific identifier types (validation and url generation). It makes it impossible to add a new identifier type without changing the code. What we should want to consider is adding something to the definition of these controlled vocabulary values (similar to what we have for dataset field types) to store these format schemes and validations in the db.

We may decide that we're fine taking this as technical debt and create a new issue to scope this out, but ideally, this would have been discussed before this issue got to code review.

scolapasta commented 5 years ago

As DatasetVersion gets more and more helper methods to get specific dataset fields, it may make sense to consider extracting out these "helper" methods to something else. For example a DatasetFieldHelper class. This would keep the code more maintainable in the future. (and likely could have made sense to do earlier, but originally it was only a few fields for which we had "helper" methods; with schema.org we've added quite a few more it seems).

scolapasta commented 5 years ago

Re: File download urls - can we not have an option? Either decide to include them now or decide not to, but not add the jvm option? I think it's good to use the isPubliclyDownloadable filter (which when we get around to decoupling how we store guestbook responses / downloads won't be as needed).

pdurbin commented 5 years ago

I just made the three changes @djbrooke @scolapasta and I agreed on:

We're down to just the questions above I have for @jggautier so I assigned this to him.

scolapasta commented 5 years ago

Added #5271 and #5272.

Since we decided to defer #5271 and leave the logic in the code, it would be cleaner for the validate methods to be one generic method rather than multiple.

So you can pass the reg expression to the method as a parameter. In this way we the validate method won't have to change for #5271.

pdurbin commented 5 years ago

I just made the changes @jggautier @scolapasta @djbrooke @kcondon and I agreed to after standup:

kcondon commented 5 years ago

So, aside from internal code restructuring, this pr: -adds new fields to schema.org (create/export dataset, verify against list julian provides) -changes the structure of some fields in schema.org (multiple, object type) (same as above, add multiple where appropriate, paste into Google validation tool) -adds optional hide files jvm option to block download urls in export (verify on/off behavior and pubic/restricted file behavior) -publisher and provider will be the instance name (root dv) (verify against export)

kcondon commented 5 years ago

Issues/questions:

  1. files/distribution section contains additional, unspecified info: file name, pid.
  2. author id is missing if value is entered in a nonconforming format but not indication exists what the conforming format is.

Discussed above with Julian and he will complete review. Will discuss with Julian and Phil to see what needs to be addressed.

jggautier commented 5 years ago

Another issue:

  1. The URLs for the related publications show up wrapped in html:
    "citation": [
    {
      "@type": "CreativeWork",
      "text": "Related pub citation 1",
      "@id": "<a href=\"https://doi.org/10.7910/DVN/P7EVGF\" target=\"_blank\">https://doi.org/10.7910/DVN/P7EVGF</a>",
      "identifier": "<a href=\"https://doi.org/10.7910/DVN/P7EVGF\" target=\"_blank\">https://doi.org/10.7910/DVN/P7EVGF</a>"
jggautier commented 5 years ago
  1. @id is missing from the files/distribution section @kcondon mentioned in his comment (@pdurbin and I agreed to keep the extra file info.) We're always using @id whenever identifier is used. I discussed with @pdurbin and he'll update.
"distribution": [
    {
      "@type": "DataDownload",
      "name": "file1.txt",
      "fileFormat": "text/plain",
      "contentSize": 26,
      "description": "File description 1",
      "identifier": "https://hdl.handle.net/20.500.12050/FK2/TWFVRE/222222",
      "@id": "https://hdl.handle.net/20.500.12050/FK2/TWFVRE/222222",
      "contentUrl": "https://demo.dataverse.org/api/access/datafile/:persistentId?persistentId=doi:10.5072/FK2/CFWNSH/ZEHFD0"

(contentUrl should appear only when the installation indicates that they want download URLs appearing in their schema.org exports.)

pdurbin commented 5 years ago

Yep, I got rid of the "href" stuff in fcae94e and added @id at the file level in 0e0b55d.

jggautier commented 5 years ago

I looked at the schema.org export and all four issues are resolved!

@kcondon noticed that contentUrl isn't showing up in the schema.org export of a test dataset, although we expect it to. (It's the dataset titled "Test Schema Org Julian 5 Schema" on the "internal" test instance.)

pdurbin commented 5 years ago

For the record, as discussed with @kcondon and @jggautier , the FileUtil.isPubliclyDownloadable logic is used to contentUrl wasn't being shown because the dataset had terms of use. It also checks for guestbooks. Both of these require a popup to agree to or fill out in the UI.