IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
879 stars 486 forks source link

Affiliations entered in affiliation fields are parenthesized in "Datacite" and Schema.org exports #9330

Open jggautier opened 1 year ago

jggautier commented 1 year ago

When does this issue occur? When Dataverse creates "Datacite" and Schema.org metadata exports for datasets that have values in a few Affiliation fields in the Citation metadatablock

Which page(s) does it occurs on? Metadata exports and OAI-PMH feed

What happens? The affiliation metadata that depositors add to their datasets, e.g. Author Affiliation, Point of Contact Affiliation, Producer Affiliation, appears in the "Datacite" and Schema.org exports wrapped in parenthesis.

The "Datacite" export has these affiliation fields:

The Schema.org export has this affiliation field:

To whom does it occur (all users, curators, superusers)? All users. It probably affects search, such as when using facets to narrow search results

What did you expect to happen? The affiliation metadata would appear in the exports without the added parentheses

Which version of Dataverse are you using? 5.12.1

Any related open or closed issues to this bug report? The issues related to using an algorithm to guess if the names entered in the author metadata field are people or organizations: https://github.com/IQSS/dataverse/issues/7349 and https://github.com/IQSS/dataverse/issues/5029. Will the PR to address those issues, https://github.com/IQSS/dataverse/pull/9089, remove the parenthesis? I think it might since the Schema.org exports that QDR's Dataverse fork creates already use the algorithm, and in their Schema.org exports, author affiliations aren't wrapped in parentheses, e.g. their Schema.org export at https://data.qdr.syr.edu/api/datasets/export?exporter=schema.org&persistentId=doi:10.5064/F6G3T1PF

Screenshots:

How the affiliations of the Author, Point of Contact, and Producer fields in Datacite export of the dataset at https://doi.org/10.7910/DVN/MUJHGR (published in Harvard Dataverse ):

How the affiliations of the Author field appears in Schema.org export of the dataset at https://doi.org/10.7910/DVN/MUJHGR (published in Harvard Dataverse ):

Definition of done: When the affiliation metadata is not wrapped in parenthesis when it appears in metadata exports

jggautier commented 5 months ago

This bug with the parentheses exists in the Schema.org exports of older dataset versions but not in the Schema.org exports of more recently published dataset versions:

As far as I can tell, in the Schema.org exports of all datasets published more recently, the author affiliations don't have parentheses.

I think this problem might be related to the discussion in https://github.com/IQSS/dataverse/issues/5144, where we talked about how to make sure that when we make changes to how Dataverse adds metadata to the DataCite metadata export, we ensure that the datasets published before those changes were made have their exports updated.

The same should be true for the Schema.org export and other exports. In the Schema.org export of a dataset published today, we can see changes that were made when v5.13 was applied to Harvard Dataverse. Those changes don't show up in those two dataset exports I mentioned earlier and probably many datasets in Harvard Dataverse whose latest versions were published before v5.13 was applied to Harvard Dataverse.

lmaylein commented 2 months ago

This bug still exists in v6.2. Is it possible to fix it? As a result of this bug, the metadata of all DOIs registered with Datacite are also incorrect.

jggautier commented 2 months ago

Hi @lmaylein. Thanks for asking! I think that the more recent work described in the GitHub issue at https://github.com/IQSS/dataverse/issues/5889 will fix this bug. Specifically, the OpenAIRE export doesn't include these parentheses, so in a comment in that GitHub issue I proposed that the merged export also wouldn't include the parentheses around the affiliations of the Author metadata field. And I imagine that parentheses will not be included around the affiliations of the other fields that describe people or organizations, too, such as Point of Contact, Contributor, Producer, and Distributor.

pdurbin commented 2 months ago

As far as I can tell, in the Schema.org exports of all datasets published more recently, the author affiliations don't have parentheses.

Is the fix to re-export datasets? https://guides.dataverse.org/en/6.3/admin/metadataexport.html#batch-exports-through-the-api

Do we know which PR fixed it, by removing the parentheses (if it is indeed fixed)?

qqmyers commented 2 months ago

Schema.org was fixed in #9089. The problem for DataCite is that the displayValue for affiliation is sent to DataCite - see https://github.com/IQSS/dataverse/blob/a466c97d02e84160c75529b915bda5c664e38ec9/src/main/java/edu/harvard/iq/dataverse/pidproviders/doi/XmlMetadataTemplate.java#L163. I'm addressing it in #10615, #10632 (which need updates), but it could be addressed separately, or ~worked around by removing the parens in the formatting at https://github.com/IQSS/dataverse/blob/a466c97d02e84160c75529b915bda5c664e38ec9/scripts/api/data/metadatablocks/citation.tsv#L13 and resending the metadata to DataCite using the API (and assuming display without parens is OK).

pdurbin commented 2 months ago

Oh, the displayValue. Thanks.

Hmm, I assume the parens are there in the displayValue for a reason. That is, we probably shouldn't remove them.

@qqmyers I'm fine with waiting for one of your PRs above. If you address this bug in one of them, please use the normal "closes #9330" syntax so this issue goes through QA.