Align or merge DataCite metadata exports

jggautier commented 5 years ago

This issue is meant to record the differences between Dataverse's two newest metadata exports as of v4.14, "DataCite"/"Datacite" and "OpenAIRE"/"oai_datacite", and discussion about how to align (or possibly merge) the very similar exports.

As part of v4.10 (released in Dec. 2018), Dataverse makes available through the UI, API and over OAI-PMH dataset metadata in the DataCite schema (https://github.com/IQSS/dataverse/issues/5043). This lets Dataverse export dataset metadata in a widely-used, discipline-agnostic schema that's more standardized than Schema.org and has more metadata than Dublin Core.

As part of v4.14 (released in May 2019), Dataverse makes available through the UI, API and over OAI-PMH DataCite metadata that complies with OpenAIRE requirements (https://github.com/IQSS/dataverse/issues/4257). Repositories need to follow these requirements in order for their dataset metadata to be made discoverable (harvested) by OpenAIRE (OpenAIRE EXPLORE). The OpenAIRE metadata requirements follow the DataCite schema, with some differences between OpenAIRE and DataCite listed in their documentation.

What both exports are called depending on the export method:

openaire-datacite-export-graphic

Both metadata exports are based on DataCite 4 and are meant to be valid against the DataCite 4 schema (although the xml records available over OAI-PMH in "Datacite" format reference DataCite's 3.1 schema). But Dataverse exports them as separate formats for several reasons:

The two metadata exports were worked on at different times by different groups
When work on making Dataverse OpenAIRE compliant started, I thought the OpenAIRE export would follow the DataCite 3.1 schema since the OpenAIRE guidelines for data repositories follows DataCite 3.1. And I knew that Dataverse would eventually export DataCite 4 metadata, so it made sense to make them separate exports. But we're told the OpenAIRE folks plan to update their guidelines, so our 4Science colleagues created the OpenAIRE export following the DataCite 4 schema. (For example, a notable difference between DataCite 3 and 4 is how funder information is handled. The OpenAIRE guidelines mandate that the contributorType property is used, which is how DataCite 3 handles funder info. But Dataverse's OpenAIRE export is using the DataCite 4 fundingReferences property instead.)
The "OpenAIRE" metadata export uses an algorithm that adds metadata about whether dataset authors and contact persons are people or organizations (in DataCite's nameType attribute). The algorithm was the last thing discussed in the OpenAIRE GitHub issue.

Ideally, Dataverse would export only one metadata record, made available through the UI, API and over OAI-PMH, that follows the DataCite schema and is also OpenAIRE compliant. The way things are now, where Dataverse exports two different metadata records based on DataCite but different, people have been confused about the differences between the two metadata exports called "DataCite" and "OpenAIRE" in the UI and called "Datacite" and "oai_datacite" in the API endpoints and made available over OAI-PMH.

But we may want to maintain two metadata exports because:

the OpenAIRE export is using the nameType algorithm, which was tested during QA but only tested for evidence that the algorithm would work in at least some cases. We haven't tried to estimate how often it will correctly figure out if author/contact names of actual datasets are people or organizations (although it's based on an algorithm DataCite uses that we're told is right over 90% of the time). Would people want to be able to export or harvest metadata that does not include the nameType metadata (maybe because they find that it's not correct often enough)?
the OpenAIRE export uses one of four mandatory Access Rights terms. The rules that Dataverse uses to determine this are discussed in a GitHub issue comment. But I realized recently that the rules are too simple and lead to cases where datasets are marked as closedAccess when restricted access is more appropriate (e.g. https://doi.org/10.7910/DVN/0PMZC6, where file request is disabled, but people can request access through a process that happens outside of Dataverse). A GitHub issue about this is opened (https://github.com/IQSS/dataverse/issues/5920), so we can figure out how to assign more appropriate access rights to datasets. Until then, would people want to be able to export or harvest metadata that does not include these sometimes misleading Access Rights?

We should decide if:

Dataverse should maintain one export or two and
If maintaining only one export, make sure that it has all of the metadata available in the current two exports.
If maintaining two exports, make sure that the amount of metadata in one export is as close to the same amount in the other (and continues to be as synced as possible) and document what the differences are. (As of v4.14 the "OpenAIRE" export has more metadata than the "DataCite" export but there are things missing in both.)

mheppler commented 3 years ago

Related? Silent publishing failure when not all fields required by Datacite are present #7551

jggautier commented 3 years ago

Good point. It could be related if/when Dataverse repositories start sending more metadata to DataCite and the dependencies among the child fields of any of that metadata is the same as the dependencies of the child fields in the Producer compound field (which right now is the only field causing those silent failures).

adam3smith commented 3 years ago

@qqmyers and I are also looking at this given that what we're currently sending to DataCite is indeed rather inadequate. Looking at the Crosswalk Julian put together, it seems to me that the current OpenAIRE export is strictly better. The only field I'm seeing where DataCite has something and OpenAire doesn't is Name Identifier schemeURI and that's either just not documented or an oversight that should be fixed.

I'm not at all concerned about the naming algorithm. If anything, I think it's a good idea to try to guess organizational names. I think the closed vs. restricted data categorization is something that should get addressed, I don't see it as a blocker.

Given this, I think a single export format makes sense.

In terms of items missing from both exports, the citation metadata looks complete, but the individual subject blocks seem to have some stuff missing. From @philippconzett 's list at #7072 that's most notably the geography data, which we'd also like to capture.

We're viewing this as pretty high priority given how widely DataCite data are used (e.g. the fact that we're not linking up our funding information to the PID graph isn't great) -- is there anything we can do to help move this along?

djbrooke commented 3 years ago

Thanks @adam3smith.

@jggautier if we bring this into the sprint starting tomorrow, would you have some time over the next two weeks to get into this with a developer that picks it up? If not, the sprint starting in two weeks? I know you're spread a bit thin right now with the UI/UX work starting up, so if it makes sense to wait two weeks I think that's fine. What do you think?

poikilotherm commented 3 years ago

Is my #7077 related here, too? (Going to work on that, you folks know... Funding...)

djbrooke commented 3 years ago

@poikilotherm May be related, but I think we'd want to move these forward independently IMHO. I think much of the discussion around #7077 will happen as part of the Software Metadata WG.

adam3smith commented 3 years ago

Awesome! @jggautier -- I think you have this covered, but if there's anything you'd like another set of eyes on or a 2nd opinion just tag and/or email me.

jggautier commented 3 years ago

Thanks @adam3smith. Great to hear there's more interest in prioritizing this! I'm all on board with saving the closed vs. restricted data categorization problem (https://github.com/IQSS/dataverse/issues/5920) for another day if it moves this issue forward. I think there are a few other things we should consider:

Is there any reason why repositories wouldn't like the nameType algorithm? Is there a way to test how well it's been working generally and for certain types of names? @adam3smith or @qqmyers, would you happen to know how DataCite figured out that the algorithm they use works 90% of the time? Or should we include the nameType algorithm and later on, separate from this, figure out how well it's working?
When more metadata is sent to DataCite, we should make sure we don't run into the compound field dependency issues that the Producer metadata field had (discussed in https://github.com/IQSS/dataverse/issues/7551). (For example, the OpenAIRE export deals with missing Related Publication metadata by including it only if certain fields are filled.)
The OpenAIRE export uses the IsCitedBy relationship when including metadata from the Related Publication field. We never really resolved how to use DataCite's relation terms (discussed in https://github.com/IQSS/dataverse/issues/2778). I think we could:
- Work out how to allow depositors to define different types of relationships between their datasets and related text-based publications (like articles) and/or make it easier for repositories to choose what types of relationships they want their depositors to use. This might involve UI changes.
- Decide with the Dataverse community which one relation term to use and expect people and other systems (harvesters, indexers, etc) to interpret that term very broadly (like "this publication is somehow related to this dataset"). Then I think this term could logically be applied to the Related Publication metadata in datasets that Dataverse repositories have already published.
- Decide with the Dataverse community to use one term that we define more narrowly (like "this dataset is cited by this publication"). But does it make sense to apply that term to the metadata of existing datasets? Not all repositories know what types of relationships their depositors had in mind when entering Related Publication metadata. I'd guess a majority of the time, the dataset is used to support findings/conclusions made in an article, but the article may not be citing the dataset. Could there be other reasons why a dataset is associated with something like a journal article? And will people and other systems ever care about/rely on the differences between the relationship types? (I think for MakeDataCount, the answer right now is no: when citations are counted, any one of several types of relation terms are valid because repositories are using the terms in different ways, so the standard's designers don't want to be too strict about which relation term or terms signal a "citation".)
- Not include Related Publication metadata in this new, merged DataCite metadata export and tackle #2778 separately.

adam3smith commented 3 years ago

Thanks Julian.

I have no insight into the Datacite algorithm for distinguishing between corporate and personal authors. Maybe @mfenner would be willing to chime in?
There are a number of the fields that only make sense as conditionals (e.g. all the scheme/identifier fields). The solution described in #7606 looks good to me and would appear to solve this and seems to be scheduled to land in 5.4?
My view would be that we need to allow some more flexibility in related terms, which makes #2778 fairly complex (there's a reason it's been open for so long) and we should not let it block the low hanging fruits, i.e. go with Julian's last option and tackle it separately.

mfenner commented 3 years ago

Users can set Personal or Organizational authors via nameType. Otherwise DataCite is doing the following:

if there is an ORCID associated with the author, it is a person
if there is a givenName, it is a person
if the creatorName has something that looks like a givenName, and that givenName is in a dictionary of known given names (using https://github.com/berkmancenter/namae), it is a person. This is where the 90% comes from. The dictionary is not so good in non-European names, and there are organization names that contain a given name (e.g. "Alfred P. Sloan Foundation").

adam3smith commented 3 years ago

Thanks! Dataverse currently doesn't have a nameType option, which is why we need some sort of algorithmic solution to determine this.

The ORCID option make sense
Since Dataverse doesn't have separate given/family name fields, I'm guessing the option here is to use the presence of a comma as a heuristic (that's what Zotero would do on import and it generally works pretty well. The problem is that this will have a fair number of false positives with non-Western names, as it's common to enter names without comma and often in familyname/givenname order (e.g., Mao Zedong)

Since the name list also sounds like it works less well for non-Western names, I'd actually now be somewhat nervous about this. Do you have contacts at some of the Chinese DV installations we could ask or are there Dataverse Collections at Harvard more likely to contain non-Western creator names so we could check?

If this is indeed fairly common, labeling a significant number of people with non-Western names as institutions seems a lot more problematic than the reverse and I'd go back on my opinion above...

mfenner commented 3 years ago

The presence of a comma is unfortunately not a good heuristic for DataCite, as many repositories use "givenName familyName", instead of "familyName, givenName".

The best solution is really using givenName and familyName. The reason we use a name dictionary is mainly that adoption of givenName/familyName is too low.

adam3smith commented 3 years ago

Just to be clear -- what we're after here is not to change what Datacite does but what Dataverse does in creating metadata submitted to Datacite -- Datacite just comes in because the Dataverse algorithm for handling names is derived from your code.

I think adding separate name fields would be quite challenging at this point, though I agree that it'd be much preferable.

mfenner commented 3 years ago

I understand. One important reason for "guessing" personal names is citation styles and formatted citations (as you of course know). DataCite introduced givenName and familyName a few years ago and it is still optional as it is indeed challenging to implement.

jggautier commented 3 years ago

Thanks @mfenner as always!

@adam3smith, there was a lot of discussion in #4257 about figuring out the nameType and adapting DataCite's algorithm to address failure cases discovered during QA, but I think the summary at https://github.com/IQSS/dataverse/issues/4257#issuecomment-483325169 still holds, and includes looking for an ORCID but I don't think we looked at how well it works for non-Western names. I think we could contact folks from installations where non-Western names are common, and possibly where they're running 4.14+ Dataverse repositories, and could look at Dataverse Collections at Harvard more likely to contain non-Western creator.

Maybe the outcome of this investigation would be to figure out whether or not we need to make it possible/easier for installations to turn off the nameType algorithm for the DataCite export. @adam3smith, @qqmyers, @djbrooke. How does that sound? And work to figure out how Dataverse repositories can better determine nameType can be done as part of another issue?

@adam3smith wrote:

There are a number of the fields that only make sense as conditionals (e.g. all the scheme/identifier fields). The solution described in #7606 looks good to me and would appear to solve this and seems to be scheduled to land in 5.4?

I agree and spoke with @scolapasta about the use cases and limits of https://github.com/IQSS/dataverse/issues/7606. My understanding is that it wouldn't address cases like the Related Publication field. @scolapasta could confirm, but from what I understand 7606 wouldn't let repositories say that if the ID Type is filled, the ID Number must also be filled (or vice versa), because that compound field also has two other fields, "Citation" and "URL", which for the purposes of exporting metadata in the DataCite schema, I think those two fields should remain optional.

The code for the OpenAIRE export already handles Related Publication in a different way, only including that metadata if both ID Type and ID Number are filled (instead of taking the approach of #7606 to prompt depositors to enter the metadata the way the software/installation admins expect). I'm not sure if there are other fields to consider, but I don't think looking out for these cases will make this issue take any longer to work on.

@djbrooke wrote:

@jggautier if we bring this into the sprint starting tomorrow, would you have some time over the next two weeks to get into this with a developer that picks it up? If not, the sprint starting in two weeks? I know you're spread a bit thin right now with the UI/UX work starting up, so if it makes sense to wait two weeks I think that's fine. What do you think?

Based on all of this I'm thinking two things should be done, and I'd have time in the next two weeks to help do them, before this is ready for implementation work starts:

a review of the metadata mapping. Like @adam3smith wrote, that shouldn't be too much trouble
a look into how well the nameType algorithm is working for non-Western creator names and if installations need a way to turn the algorithm off (and not include in the DataCite export a guess about if a creator is a person or organization)

Then maybe we could aim for working on implementation in the following sprint? What do you all think?

mfenner commented 3 years ago

One small comment: the author of the library we use for names (https://github.com/berkmancenter/namae) is @inukshuk who @adam3smith knows from citationstyles work, maybe it is worth reaching out to him, e.g. to ask about handling of non-Western names.

qqmyers commented 3 years ago

One quick thought: It might be simple to add a person/org choice field and just use ‘the algorithm’ to pre-populate that for existing data, i.e. we only use it to handle legacy info rather than in an ongoing way. (Could even make it something that could be optional if admins don’t think it works well for their installations.)

-- Jim

From: Julian Gautier [mailto:notifications@github.com] Sent: Wednesday, March 10, 2021 12:06 PM To: IQSS/dataverse Cc: qqmyers; Mention Subject: Re: [IQSS/dataverse] Align (or merge) DataCite metadata exports (#5889)

Thanks @mfennerhttps://github.com/mfenner as always!

@adam3smithhttps://github.com/adam3smith, there was a lot of discussion in #4257https://github.com/IQSS/dataverse/issues/4257 about figuring out the nameType and adapting DataCite's algorithm to address failure cases discovered during QA, but I think the summary at #4257 (comment)https://github.com/IQSS/dataverse/issues/4257#issuecomment-483325169 still holds, and includes looking for an ORCID but I don't think we looked at how well it works for non-Western names. I think we could contact folks from installations where non-Western names are common, and possibly where they're running 4.14+ Dataverse repositories, and could look at Dataverse Collections at Harvard more likely to contain non-Western creator.

Maybe the outcome of this investigation would be to figure out whether or not we need to make it possible/easier for installations to turn off the nameType algorithm for the DataCite export. @adam3smithhttps://github.com/adam3smith, @qqmyershttps://github.com/qqmyers, @djbrookehttps://github.com/djbrooke. How does that sound? And work to figure out how Dataverse repositories can better determine nameType can be done as part of another issue?

@adam3smithhttps://github.com/adam3smith wrote:

There are a number of the fields that only make sense as conditionals (e.g. all the scheme/identifier fields). The solution described in #7606https://github.com/IQSS/dataverse/issues/7606 looks good to me and would appear to solve this and seems to be scheduled to land in 5.4?

I agree and spoke with @scolapastahttps://github.com/scolapasta about the use cases and limits of #7606https://github.com/IQSS/dataverse/issues/7606. My understanding is that it wouldn't address cases like the Related Publication field. @scolapastahttps://github.com/scolapasta could confirm, but from what I understand 7606 wouldn't let repositories say that if the ID Type is filled, the ID Number must also be filled (or vice versa), because that compound field also has two other fields, "Citation" and "URL", which for the purposes of exporting metadata in the DataCite schema, I think those two fields should remain optional.

The code for the OpenAIRE export already handles Related Publication in a different way, only including that metadata if both ID Type and ID Number are filled (instead of taking the approach of #7606https://github.com/IQSS/dataverse/issues/7606 to prompt depositors to enter the metadata the way the software/installation admins expect). I'm not sure if there are other fields to consider, but I don't think looking out for these cases will make this issue take any longer to work on.

@djbrookehttps://github.com/djbrooke wrote:

@jggautierhttps://github.com/jggautier if we bring this into the sprint starting tomorrow, would you have some time over the next two weeks to get into this with a developer that picks it up? If not, the sprint starting in two weeks? I know you're spread a bit thin right now with the UI/UX work starting up, so if it makes sense to wait two weeks I think that's fine. What do you think?

Based on all of this I'm thinking two things should be done, and I'd have time in the next two weeks to help do them, before this is ready for implementation:

a review of the metadata mapping. Like @adam3smithhttps://github.com/adam3smith wrote, that shouldn't be too much trouble
a look into how well the nameType algorithm is working for non-Western creator names and if installations need a way to turn the algorithm off (and not include in the DataCite export a guess about if an creator is a person or organization)

Then maybe we could aim for working on implementation in the following sprint? What do you all think?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/IQSS/dataverse/issues/5889#issuecomment-795748418, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABTLRTY6EVOK4WUM5GGM5ALTC6KHDANCNFSM4HQF4N2Q.

adam3smith commented 3 years ago

a review of the metadata mapping. Like @adam3smith wrote, that shouldn't be too much trouble

a look into how well the nameType algorithm is working for non-Western creator names and if installations need a way to turn the algorithm off (and not include in the DataCite export a guess about if an creator is a person or organization)

Then maybe we could aim for working on implementation in the following sprint? What do you all think?

That sounds good to me.

It might be simple to add a person/org choice field and just use ‘the algorithm’ to pre-populate that for existing data, i.e. we only use it to handle legacy info rather than in an ongoing way.

We'd be happy with this -- the more control we have over metadata the better -- but there may be concern about too many UI elements for self-deposit repositories.

abollini commented 3 years ago

Sorry for joining the discussion so late, I just want to add a reference to the inprogress update to the OpenAIRE DataArchive guidelines that will be based on the Datacite version 4 schema https://openaire-guidelines-for-data-archive-managers.readthedocs.io/en/latest/index.html

This is essentially the new version of the guidelines that we were requested to develop for in 2018 (to be more specific at this time we have looked to the Datacite schema v4.1) and was contributed to Dataverse in 4.14

The OpenAIRE team is still working on the new version, I take the freedom to ping them on this thread https://github.com/openaire/guidelines-data-archives/issues/2 so that they will be aware of the work in progress on the Dataverse community

jggautier commented 3 years ago

Hi @abollini. I don't think you're late at all. The status of this issue was brought up in a recent Dataverse community meeting, so I thought it would be helpful to write here that the plans being discussed in this GitHub issue for how to proceed haven't been started or finalized. I think it's great that the OpenAIRE team will be aware of this discussion. Thanks!

jggautier commented 2 years ago

Just noticed that in the DataCite export's of installations running Dataverse software v5.9 and maybe all earlier versions, parentheses are added to the Author Affiliation values that are put in DataCite's creator > affiliation element:

The screenshot is from an export from Demo Dataverse, running v5.9. It's also done in this export from DataverseNL (v5.9)

Maybe this is because the code is getting what's displayed on the dataset page instead of what's entered in the field on the edit metadata page? Looks like that was the issue when Author Affiliation values were wrapped in parenthesis in the search API results (https://github.com/IQSS/dataverse/issues/6570#issuecomment-582566703)

The OpenAIRE export doesn't include the parenthesis, so I mention this bug in this issue since it seems natural that merging these two exports, or aligning them more, would also fix this parenthesis bug.

jggautier commented 8 months ago

I'm proposing we work on merging the DataCite and OpenAIRE exports as soon as possible. In addition to addressing the challenges I wrote about earlier in this GitHub issue related to maintaining these two exports, I think merging the two exports is a step toward the goal of sending to DataCite the metadata that Dataverse repositories have been collecting for years, while we continue research and design work toward the goals of collecting and distributing more metadata about related research objects, different types of persistent IDs of people and organizations that are associated with deposited data and code, and geospatial metadata. This is part of a proposal outlined at https://github.com/IQSS/dataverse.harvard.edu/issues/230 for better alignment with NIH GREI's metadata recommendations, which largely mirror the Dataverse community's long-standing goals of collecting and sending more metadata to DataCite.

With the current DataCite and OpenAIRE exports merged:

There will be just one export, called "DataCite" in the "Metadata Export" dropdown on the deposit page, and called oai_datacite as a metadataFormat in OAI-PMH feeds. This way Dataverse repositories that need to make the metadata available to OpenAIRE's systems can continue providing the oai_datacite metadataFormat in their OAI-PMH feeds.
The metadata in the merged export will be sent to DataCite when dataset versions are published, just as Dataverse currently sends to DataCite the metadata in the DataCite export.
The XML's schemaLocation element will point to the xsd of DataCite's 4.5 scheme, such as <resource xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://datacite.org/schema/kernel-4" xsi:schemaLocation="http://datacite.org/schema/kernel-4 https://schema.datacite.org/meta/kernel-4.5/metadata.xsd">

This is mostly a cosmetic and maybe symbolic change for now, and I think not technically beneficial or necessary, as there's nothing about the metadata included in the current DataCite and OpenAIRE exports that has been introduced in more recent versions of DataCite's schema. The DataCite and OpenAIRE exports are valid against DataCite's 4.5 scheme for the most part (see the next point for why the OpenAIRE export isn't always valid), so I think it's fine to have the XML export's schemaLocation element point to the xsd of DataCite's 4.5 schema.

Doing this may have some symbolic benefits, as this change might also signal to stakeholders that the Dataverse community is intent on taking advantage of changes that DataCite has introduced in the more recent versions of their schema as we work on collecting and sending more metadata to DataCite (in efforts outside the scope of this proposal).

@abollini should also be consulted about the related comments he left in a GitHub issue at https://github.com/openaire/guidelines-data-archives/issues/2
The HTML tags present in the Description metadata field, and possibly other metadata fields where depositors can type in HTML tags, are removed when they're included in the export. The HTML tags aren't present in the current DataCite exports, but are in the OpenAIRE exports. And when I try to validate the current OpenAIRE export, the validator I use complains about the HTML tags, e.g. <b>, in the abstract/description element.
The merged export includes the nameIdentifier schemaURIs of all Identifier Types of the Author field
The merged export does not include the parenthesis around the Affiliations of the Author field. This would address part of https://github.com/IQSS/dataverse/issues/9330
The merged export includes nameType guesses for names entered in the Producer field and names entered in the Contributor field
The merged export includes guesses about the first and last names of contributors when the nameType algorithm guesses that names entered in Contributor Name fields are people
The merged export includes the relationships between the deposit and its files, such as <relatedIdentifier relatedIdentifierType="DOI" relationType="HasPart">doi:10.70122/FK2/O8JAOY/FYOKMN</relatedIdentifier>. This metadata is included in the current DataCite export and not included in the current OpenAIRE export.
The merged export uses the terms from the info:eu-repo-Access-Terms vocabulary, which the OpenAIRE export uses to follow OpenAIRE's guidelines, e.g. <rights rightsURI="info:eu-repo/semantics/openAccess"/>, using logic summarized in the GitHub issue comment at https://github.com/IQSS/dataverse/issues/5920#issue-452786970.

We've documented challenges with this logic; see https://github.com/IQSS/dataverse/issues/5920. Those challenges can be addressed in efforts separate from this proposal.
The PIDs entered in the Related Publications field will be included, as the current OpenAIRE export includes, but using the relationType "IsSupplementTo", instead of "isCitedBy", to indicate the relationship between the deposit and the related text-based publication.

Among other reasons, using "IsSupplementTo" will make it easier for QDR's repository, which is already using "IsSupplementTo", to use this merged export that will ship with Dataverse instead of maintaining this part of their Dataverse fork. I've also checked the DataCite exports of 78 repositories that use Dataverse, and only QDR's repository seems to be sending this metadata to DataCite. Dataverse e-cienciaDatos used to send this metadata, using the "isCitedBy" relationType, but I don't see it in their DataCite exports anymore. I've emailed the contacts we have for that repository to find out why, but that's outside of the scope of this proposal.

This will also do what @pkiraly proposed in the PR at https://github.com/IQSS/dataverse/pull/8357, which can be closed after the DataCite and OpenAIRE exports are merged.

This proposal would also address most of what's discussed in the GitHub issue at https://github.com/IQSS/dataverse/issues/8108. I think that GitHub issue can be closed after the DataCite and OpenAIRE exports are merged. The parts of the discussion in that GitHub issue that would be unaddressed, about sending more metadata about related research objects, would be addressed separately in later stages of a proposal outlined at https://github.com/IQSS/dataverse.harvard.edu/issues/230.

jggautier commented 7 months ago

Just an update about what I wrote last week about consulting @abollini about the related comments he left in a GitHub issue at https://github.com/openaire/guidelines-data-archives/issues/2.

In that issue I commented to let @abollini know about this proposal to merge the two exports, asked for feedback about having the merged export's schemaLocation point to the xsd of DataCite's 4.5 schema, and asked for more information about bringing "arguments from the OpenAIRE team" to this effort.

adam3smith commented 7 months ago

Thanks Julian -- we'd be very happy to see this merged and I think it'd have significant downstream benefits to improve the Dataverse-deposited metadata with Datacite this way.

poikilotherm commented 7 months ago

I'm still having this crazy idea about generating model classes from the Schema XSDs and create mappers from our internal metadata model to the target model...

jggautier commented 7 months ago

Hey @poikilotherm, would this be a better way to change the exports? Would it take a lot of time to do?

sbarbosadataverse commented 7 months ago

Ceilyn and Sonia priorized and moved to sprint ready as part of GREI Y3 planning @jggautier @scolapasta Please weigh in if you have objections.

poikilotherm commented 7 months ago

@jggautier I put together a very simple demonstrator for the generator part, using the DataCite 4.5 Kernel. (It does not include the mapper part, where we map our internal to the generated model. I could create an example exporter for that if you want.) To run the example, use this:

git clone --branch 5889-gen-schema-pojos https://github.com/IQSS/dataverse.git dataverse
cd dataverse
mvn -f modules/dataverse-schemas package

Aside from that, here's the comparison: https://github.com/IQSS/dataverse/compare/5889-gen-schema-pojos

jggautier commented 7 months ago

Thanks @sbarbosadataverse. I don't have any objections to this being prioritized and moved to sprint ready. I'm worried we won't hear back from folks from OpenAIRE by the end of the sprint next Wednesday. I'll reach out to @abollini again in https://github.com/openaire/guidelines-data-archives/issues/2

@poikilotherm I'm hesitant to try to better understand what generators are. But could you write about the benefits? For example, does it make it easier to change the exporters?

poikilotherm commented 7 months ago

Currently, for DataCite we use a template approach, combined with XML processing. For DDI we use AFAIK an XML only processing approach. For our JSON based exports we use mostly JSON processing.

The point is: all of this is hand crafted. The implementation is done by us and we need to make sure the serialized output matches the specifications involved. We also provide the mapping from our internal model to the target model with these serializers.

When using generators, parts of the process are put upside down. You start with the spec (XML XSD, Json Schema, Open API...) and you use a tool to generate model classes out of these.

The result are classes that can be serialized to the target output data using the Jakarta standard included data binding mechanisms. Beyond that, these classes can also be used for the inverted process: deserialization from some data to the model. An example would be importing DataCite XML from OAI-PMH: use the data binding to get a populated Java model of the data.

As the model classes are generated from the spec, they are known to fully transform all of the spec into the model. We might not use all of the available modeling, but at least we can easily extend without much hassle.

As long as the generator tools don't make mistakes, the data binding is always going to be valid output data as well as always map from correct input data back to the model.

Using our own implementations for de-/serialization requires extensive testing and also lot of manual work to implement every change etc.

The availability of schemas and model classes for them allows a much stricter enforcing of data validity at compile and runtime. Constraints about the data from the spec are transported into the data model, allowing for simpler interaction with the model from code as well as the Java compiler assisting you to build it. Example: most generators will allow you to create a Fluent API for the model.

For the exporters, having schemas around (and I'm talking about more than just DataCite) will also allow for a clearer defined data exchange between the core application and plugged in exporters. The model classes provide Data Transfer Objects as a side product.

Also, upgrading schemas is improved. We can include a generated data model version for any version of a schema. If we want to change the supported schema version, the Java code can help us determine what to change and how. It's much clearer in code what is supported and what isn't. Changing a version means change the import path for them classes.

Brain dump out.

jggautier commented 4 months ago

@cmbz asked me to add a status update to this GitHub issue. There's discussion and related work in the pull request at https://github.com/IQSS/dataverse/pull/10615 that addresses at least some of what's been proposed in this GitHub issue.

DS-INRAE commented 4 months ago

Another related issue :

2917

cmbz commented 3 months ago

To focus on the most important features and bugs, we are closing issues created before 2020 (version 5.0) that are not new feature requests with the label 'Type: Feature'.

If you created this issue and you feel the team should revisit this decision, please reopen the issue and leave a comment.

pdurbin commented 3 months ago

This issue has an open PR...

10615

... so I'm reopening it. It'll be closed when we merge it.

pdurbin commented 2 months ago

We're now using this PR instead to close this issue:

10632

jggautier commented 2 months ago

Thanks for the heads up @pdurbin. I'm going to keep this issue open, or I guess re-open it after that PR is merged, so that I can see what decisions were made and what goals and questions aren't addressed yet.

pdurbin commented 2 months ago

@jggautier sounds good. Perhaps we can create a new issue with any remaining items.

IQSS / dataverse

Align or merge DataCite metadata exports #5889

2917

10615

10632