IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
883 stars 495 forks source link

Align or merge DataCite metadata exports #5889

Open jggautier opened 5 years ago

jggautier commented 5 years ago

This issue is meant to record the differences between Dataverse's two newest metadata exports as of v4.14, "DataCite"/"Datacite" and "OpenAIRE"/"oai_datacite", and discussion about how to align (or possibly merge) the very similar exports.

As part of v4.10 (released in Dec. 2018), Dataverse makes available through the UI, API and over OAI-PMH dataset metadata in the DataCite schema (https://github.com/IQSS/dataverse/issues/5043). This lets Dataverse export dataset metadata in a widely-used, discipline-agnostic schema that's more standardized than Schema.org and has more metadata than Dublin Core.

As part of v4.14 (released in May 2019), Dataverse makes available through the UI, API and over OAI-PMH DataCite metadata that complies with OpenAIRE requirements (https://github.com/IQSS/dataverse/issues/4257). Repositories need to follow these requirements in order for their dataset metadata to be made discoverable (harvested) by OpenAIRE (OpenAIRE EXPLORE). The OpenAIRE metadata requirements follow the DataCite schema, with some differences between OpenAIRE and DataCite listed in their documentation.

What both exports are called depending on the export method:

openaire-datacite-export-graphic

Both metadata exports are based on DataCite 4 and are meant to be valid against the DataCite 4 schema (although the xml records available over OAI-PMH in "Datacite" format reference DataCite's 3.1 schema). But Dataverse exports them as separate formats for several reasons:

Ideally, Dataverse would export only one metadata record, made available through the UI, API and over OAI-PMH, that follows the DataCite schema and is also OpenAIRE compliant. The way things are now, where Dataverse exports two different metadata records based on DataCite but different, people have been confused about the differences between the two metadata exports called "DataCite" and "OpenAIRE" in the UI and called "Datacite" and "oai_datacite" in the API endpoints and made available over OAI-PMH.

But we may want to maintain two metadata exports because:

We should decide if:

mheppler commented 3 years ago

Related? Silent publishing failure when not all fields required by Datacite are present #7551

jggautier commented 3 years ago

Good point. It could be related if/when Dataverse repositories start sending more metadata to DataCite and the dependencies among the child fields of any of that metadata is the same as the dependencies of the child fields in the Producer compound field (which right now is the only field causing those silent failures).

adam3smith commented 3 years ago

@qqmyers and I are also looking at this given that what we're currently sending to DataCite is indeed rather inadequate. Looking at the Crosswalk Julian put together, it seems to me that the current OpenAIRE export is strictly better. The only field I'm seeing where DataCite has something and OpenAire doesn't is Name Identifier schemeURI and that's either just not documented or an oversight that should be fixed.

I'm not at all concerned about the naming algorithm. If anything, I think it's a good idea to try to guess organizational names. I think the closed vs. restricted data categorization is something that should get addressed, I don't see it as a blocker.

Given this, I think a single export format makes sense.

In terms of items missing from both exports, the citation metadata looks complete, but the individual subject blocks seem to have some stuff missing. From @philippconzett 's list at #7072 that's most notably the geography data, which we'd also like to capture.

We're viewing this as pretty high priority given how widely DataCite data are used (e.g. the fact that we're not linking up our funding information to the PID graph isn't great) -- is there anything we can do to help move this along?

djbrooke commented 3 years ago

Thanks @adam3smith.

@jggautier if we bring this into the sprint starting tomorrow, would you have some time over the next two weeks to get into this with a developer that picks it up? If not, the sprint starting in two weeks? I know you're spread a bit thin right now with the UI/UX work starting up, so if it makes sense to wait two weeks I think that's fine. What do you think?

poikilotherm commented 3 years ago

Is my #7077 related here, too? (Going to work on that, you folks know... Funding...)

djbrooke commented 3 years ago

@poikilotherm May be related, but I think we'd want to move these forward independently IMHO. I think much of the discussion around #7077 will happen as part of the Software Metadata WG.

adam3smith commented 3 years ago

Awesome! @jggautier -- I think you have this covered, but if there's anything you'd like another set of eyes on or a 2nd opinion just tag and/or email me.

jggautier commented 3 years ago

Thanks @adam3smith. Great to hear there's more interest in prioritizing this! I'm all on board with saving the closed vs. restricted data categorization problem (https://github.com/IQSS/dataverse/issues/5920) for another day if it moves this issue forward. I think there are a few other things we should consider:

adam3smith commented 3 years ago

Thanks Julian.

mfenner commented 3 years ago

Users can set Personal or Organizational authors via nameType. Otherwise DataCite is doing the following:

adam3smith commented 3 years ago

Thanks! Dataverse currently doesn't have a nameType option, which is why we need some sort of algorithmic solution to determine this.

Since the name list also sounds like it works less well for non-Western names, I'd actually now be somewhat nervous about this. Do you have contacts at some of the Chinese DV installations we could ask or are there Dataverse Collections at Harvard more likely to contain non-Western creator names so we could check?

If this is indeed fairly common, labeling a significant number of people with non-Western names as institutions seems a lot more problematic than the reverse and I'd go back on my opinion above...

mfenner commented 3 years ago

The presence of a comma is unfortunately not a good heuristic for DataCite, as many repositories use "givenName familyName", instead of "familyName, givenName".

The best solution is really using givenName and familyName. The reason we use a name dictionary is mainly that adoption of givenName/familyName is too low.

adam3smith commented 3 years ago

Just to be clear -- what we're after here is not to change what Datacite does but what Dataverse does in creating metadata submitted to Datacite -- Datacite just comes in because the Dataverse algorithm for handling names is derived from your code.

I think adding separate name fields would be quite challenging at this point, though I agree that it'd be much preferable.

mfenner commented 3 years ago

I understand. One important reason for "guessing" personal names is citation styles and formatted citations (as you of course know). DataCite introduced givenName and familyName a few years ago and it is still optional as it is indeed challenging to implement.

jggautier commented 3 years ago

Thanks @mfenner as always!

@adam3smith, there was a lot of discussion in #4257 about figuring out the nameType and adapting DataCite's algorithm to address failure cases discovered during QA, but I think the summary at https://github.com/IQSS/dataverse/issues/4257#issuecomment-483325169 still holds, and includes looking for an ORCID but I don't think we looked at how well it works for non-Western names. I think we could contact folks from installations where non-Western names are common, and possibly where they're running 4.14+ Dataverse repositories, and could look at Dataverse Collections at Harvard more likely to contain non-Western creator.

Maybe the outcome of this investigation would be to figure out whether or not we need to make it possible/easier for installations to turn off the nameType algorithm for the DataCite export. @adam3smith, @qqmyers, @djbrooke. How does that sound? And work to figure out how Dataverse repositories can better determine nameType can be done as part of another issue?

@adam3smith wrote:

There are a number of the fields that only make sense as conditionals (e.g. all the scheme/identifier fields). The solution described in #7606 looks good to me and would appear to solve this and seems to be scheduled to land in 5.4?

I agree and spoke with @scolapasta about the use cases and limits of https://github.com/IQSS/dataverse/issues/7606. My understanding is that it wouldn't address cases like the Related Publication field. @scolapasta could confirm, but from what I understand 7606 wouldn't let repositories say that if the ID Type is filled, the ID Number must also be filled (or vice versa), because that compound field also has two other fields, "Citation" and "URL", which for the purposes of exporting metadata in the DataCite schema, I think those two fields should remain optional.

The code for the OpenAIRE export already handles Related Publication in a different way, only including that metadata if both ID Type and ID Number are filled (instead of taking the approach of #7606 to prompt depositors to enter the metadata the way the software/installation admins expect). I'm not sure if there are other fields to consider, but I don't think looking out for these cases will make this issue take any longer to work on.

@djbrooke wrote:

@jggautier if we bring this into the sprint starting tomorrow, would you have some time over the next two weeks to get into this with a developer that picks it up? If not, the sprint starting in two weeks? I know you're spread a bit thin right now with the UI/UX work starting up, so if it makes sense to wait two weeks I think that's fine. What do you think?

Based on all of this I'm thinking two things should be done, and I'd have time in the next two weeks to help do them, before this is ready for implementation work starts:

Then maybe we could aim for working on implementation in the following sprint? What do you all think?

mfenner commented 3 years ago

One small comment: the author of the library we use for names (https://github.com/berkmancenter/namae) is @inukshuk who @adam3smith knows from citationstyles work, maybe it is worth reaching out to him, e.g. to ask about handling of non-Western names.

qqmyers commented 3 years ago

One quick thought: It might be simple to add a person/org choice field and just use ‘the algorithm’ to pre-populate that for existing data, i.e. we only use it to handle legacy info rather than in an ongoing way. (Could even make it something that could be optional if admins don’t think it works well for their installations.)

-- Jim

From: Julian Gautier [mailto:notifications@github.com] Sent: Wednesday, March 10, 2021 12:06 PM To: IQSS/dataverse Cc: qqmyers; Mention Subject: Re: [IQSS/dataverse] Align (or merge) DataCite metadata exports (#5889)

Thanks @mfennerhttps://github.com/mfenner as always!

@adam3smithhttps://github.com/adam3smith, there was a lot of discussion in #4257https://github.com/IQSS/dataverse/issues/4257 about figuring out the nameType and adapting DataCite's algorithm to address failure cases discovered during QA, but I think the summary at #4257 (comment)https://github.com/IQSS/dataverse/issues/4257#issuecomment-483325169 still holds, and includes looking for an ORCID but I don't think we looked at how well it works for non-Western names. I think we could contact folks from installations where non-Western names are common, and possibly where they're running 4.14+ Dataverse repositories, and could look at Dataverse Collections at Harvard more likely to contain non-Western creator.

Maybe the outcome of this investigation would be to figure out whether or not we need to make it possible/easier for installations to turn off the nameType algorithm for the DataCite export. @adam3smithhttps://github.com/adam3smith, @qqmyershttps://github.com/qqmyers, @djbrookehttps://github.com/djbrooke. How does that sound? And work to figure out how Dataverse repositories can better determine nameType can be done as part of another issue?

@adam3smithhttps://github.com/adam3smith wrote:

There are a number of the fields that only make sense as conditionals (e.g. all the scheme/identifier fields). The solution described in #7606https://github.com/IQSS/dataverse/issues/7606 looks good to me and would appear to solve this and seems to be scheduled to land in 5.4?

I agree and spoke with @scolapastahttps://github.com/scolapasta about the use cases and limits of #7606https://github.com/IQSS/dataverse/issues/7606. My understanding is that it wouldn't address cases like the Related Publication field. @scolapastahttps://github.com/scolapasta could confirm, but from what I understand 7606 wouldn't let repositories say that if the ID Type is filled, the ID Number must also be filled (or vice versa), because that compound field also has two other fields, "Citation" and "URL", which for the purposes of exporting metadata in the DataCite schema, I think those two fields should remain optional.

The code for the OpenAIRE export already handles Related Publication in a different way, only including that metadata if both ID Type and ID Number are filled (instead of taking the approach of #7606https://github.com/IQSS/dataverse/issues/7606 to prompt depositors to enter the metadata the way the software/installation admins expect). I'm not sure if there are other fields to consider, but I don't think looking out for these cases will make this issue take any longer to work on.

@djbrookehttps://github.com/djbrooke wrote:

@jggautierhttps://github.com/jggautier if we bring this into the sprint starting tomorrow, would you have some time over the next two weeks to get into this with a developer that picks it up? If not, the sprint starting in two weeks? I know you're spread a bit thin right now with the UI/UX work starting up, so if it makes sense to wait two weeks I think that's fine. What do you think?

Based on all of this I'm thinking two things should be done, and I'd have time in the next two weeks to help do them, before this is ready for implementation:

Then maybe we could aim for working on implementation in the following sprint? What do you all think?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/IQSS/dataverse/issues/5889#issuecomment-795748418, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABTLRTY6EVOK4WUM5GGM5ALTC6KHDANCNFSM4HQF4N2Q.

adam3smith commented 3 years ago
  • a review of the metadata mapping. Like @adam3smith wrote, that shouldn't be too much trouble
    • a look into how well the nameType algorithm is working for non-Western creator names and if installations need a way to turn the algorithm off (and not include in the DataCite export a guess about if an creator is a person or organization)

Then maybe we could aim for working on implementation in the following sprint? What do you all think?

That sounds good to me.

It might be simple to add a person/org choice field and just use ‘the algorithm’ to pre-populate that for existing data, i.e. we only use it to handle legacy info rather than in an ongoing way.

We'd be happy with this -- the more control we have over metadata the better -- but there may be concern about too many UI elements for self-deposit repositories.

abollini commented 3 years ago

Sorry for joining the discussion so late, I just want to add a reference to the inprogress update to the OpenAIRE DataArchive guidelines that will be based on the Datacite version 4 schema https://openaire-guidelines-for-data-archive-managers.readthedocs.io/en/latest/index.html

This is essentially the new version of the guidelines that we were requested to develop for in 2018 (to be more specific at this time we have looked to the Datacite schema v4.1) and was contributed to Dataverse in 4.14

The OpenAIRE team is still working on the new version, I take the freedom to ping them on this thread https://github.com/openaire/guidelines-data-archives/issues/2 so that they will be aware of the work in progress on the Dataverse community

jggautier commented 3 years ago

Hi @abollini. I don't think you're late at all. The status of this issue was brought up in a recent Dataverse community meeting, so I thought it would be helpful to write here that the plans being discussed in this GitHub issue for how to proceed haven't been started or finalized. I think it's great that the OpenAIRE team will be aware of this discussion. Thanks!

jggautier commented 2 years ago

Just noticed that in the DataCite export's of installations running Dataverse software v5.9 and maybe all earlier versions, parentheses are added to the Author Affiliation values that are put in DataCite's creator > affiliation element:

Screen Shot 2022-02-24 at 12 42 02 PM

The screenshot is from an export from Demo Dataverse, running v5.9. It's also done in this export from DataverseNL (v5.9)

Maybe this is because the code is getting what's displayed on the dataset page instead of what's entered in the field on the edit metadata page? Looks like that was the issue when Author Affiliation values were wrapped in parenthesis in the search API results (https://github.com/IQSS/dataverse/issues/6570#issuecomment-582566703)

The OpenAIRE export doesn't include the parenthesis, so I mention this bug in this issue since it seems natural that merging these two exports, or aligning them more, would also fix this parenthesis bug.

jggautier commented 8 months ago

I'm proposing we work on merging the DataCite and OpenAIRE exports as soon as possible. In addition to addressing the challenges I wrote about earlier in this GitHub issue related to maintaining these two exports, I think merging the two exports is a step toward the goal of sending to DataCite the metadata that Dataverse repositories have been collecting for years, while we continue research and design work toward the goals of collecting and distributing more metadata about related research objects, different types of persistent IDs of people and organizations that are associated with deposited data and code, and geospatial metadata. This is part of a proposal outlined at https://github.com/IQSS/dataverse.harvard.edu/issues/230 for better alignment with NIH GREI's metadata recommendations, which largely mirror the Dataverse community's long-standing goals of collecting and sending more metadata to DataCite.

With the current DataCite and OpenAIRE exports merged:

This proposal would also address most of what's discussed in the GitHub issue at https://github.com/IQSS/dataverse/issues/8108. I think that GitHub issue can be closed after the DataCite and OpenAIRE exports are merged. The parts of the discussion in that GitHub issue that would be unaddressed, about sending more metadata about related research objects, would be addressed separately in later stages of a proposal outlined at https://github.com/IQSS/dataverse.harvard.edu/issues/230.

jggautier commented 7 months ago

Just an update about what I wrote last week about consulting @abollini about the related comments he left in a GitHub issue at https://github.com/openaire/guidelines-data-archives/issues/2.

In that issue I commented to let @abollini know about this proposal to merge the two exports, asked for feedback about having the merged export's schemaLocation point to the xsd of DataCite's 4.5 schema, and asked for more information about bringing "arguments from the OpenAIRE team" to this effort.

adam3smith commented 7 months ago

Thanks Julian -- we'd be very happy to see this merged and I think it'd have significant downstream benefits to improve the Dataverse-deposited metadata with Datacite this way.

poikilotherm commented 7 months ago

I'm still having this crazy idea about generating model classes from the Schema XSDs and create mappers from our internal metadata model to the target model...

jggautier commented 7 months ago

Hey @poikilotherm, would this be a better way to change the exports? Would it take a lot of time to do?

sbarbosadataverse commented 7 months ago

Ceilyn and Sonia priorized and moved to sprint ready as part of GREI Y3 planning @jggautier @scolapasta Please weigh in if you have objections.

poikilotherm commented 7 months ago

@jggautier I put together a very simple demonstrator for the generator part, using the DataCite 4.5 Kernel. (It does not include the mapper part, where we map our internal to the generated model. I could create an example exporter for that if you want.) To run the example, use this:

git clone --branch 5889-gen-schema-pojos https://github.com/IQSS/dataverse.git dataverse
cd dataverse
mvn -f modules/dataverse-schemas package

Aside from that, here's the comparison: https://github.com/IQSS/dataverse/compare/5889-gen-schema-pojos

jggautier commented 7 months ago

Thanks @sbarbosadataverse. I don't have any objections to this being prioritized and moved to sprint ready. I'm worried we won't hear back from folks from OpenAIRE by the end of the sprint next Wednesday. I'll reach out to @abollini again in https://github.com/openaire/guidelines-data-archives/issues/2

@poikilotherm I'm hesitant to try to better understand what generators are. But could you write about the benefits? For example, does it make it easier to change the exporters?

poikilotherm commented 7 months ago

Currently, for DataCite we use a template approach, combined with XML processing. For DDI we use AFAIK an XML only processing approach. For our JSON based exports we use mostly JSON processing.

The point is: all of this is hand crafted. The implementation is done by us and we need to make sure the serialized output matches the specifications involved. We also provide the mapping from our internal model to the target model with these serializers.

When using generators, parts of the process are put upside down. You start with the spec (XML XSD, Json Schema, Open API...) and you use a tool to generate model classes out of these.

The result are classes that can be serialized to the target output data using the Jakarta standard included data binding mechanisms. Beyond that, these classes can also be used for the inverted process: deserialization from some data to the model. An example would be importing DataCite XML from OAI-PMH: use the data binding to get a populated Java model of the data.

As the model classes are generated from the spec, they are known to fully transform all of the spec into the model. We might not use all of the available modeling, but at least we can easily extend without much hassle.

As long as the generator tools don't make mistakes, the data binding is always going to be valid output data as well as always map from correct input data back to the model.

Using our own implementations for de-/serialization requires extensive testing and also lot of manual work to implement every change etc.

The availability of schemas and model classes for them allows a much stricter enforcing of data validity at compile and runtime. Constraints about the data from the spec are transported into the data model, allowing for simpler interaction with the model from code as well as the Java compiler assisting you to build it. Example: most generators will allow you to create a Fluent API for the model.

For the exporters, having schemas around (and I'm talking about more than just DataCite) will also allow for a clearer defined data exchange between the core application and plugged in exporters. The model classes provide Data Transfer Objects as a side product.

Also, upgrading schemas is improved. We can include a generated data model version for any version of a schema. If we want to change the supported schema version, the Java code can help us determine what to change and how. It's much clearer in code what is supported and what isn't. Changing a version means change the import path for them classes.

Brain dump out.

jggautier commented 4 months ago

@cmbz asked me to add a status update to this GitHub issue. There's discussion and related work in the pull request at https://github.com/IQSS/dataverse/pull/10615 that addresses at least some of what's been proposed in this GitHub issue.

DS-INRAE commented 4 months ago

Another related issue :

cmbz commented 3 months ago

To focus on the most important features and bugs, we are closing issues created before 2020 (version 5.0) that are not new feature requests with the label 'Type: Feature'.

If you created this issue and you feel the team should revisit this decision, please reopen the issue and leave a comment.

pdurbin commented 3 months ago

This issue has an open PR...

... so I'm reopening it. It'll be closed when we merge it.

pdurbin commented 2 months ago

We're now using this PR instead to close this issue:

jggautier commented 2 months ago

Thanks for the heads up @pdurbin. I'm going to keep this issue open, or I guess re-open it after that PR is merged, so that I can see what decisions were made and what goals and questions aren't addressed yet.

pdurbin commented 2 months ago

@jggautier sounds good. Perhaps we can create a new issue with any remaining items.