IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
855 stars 480 forks source link

As a curator, I want to more easily add metadata about related resources so that it's more discoverable #5277

Open jggautier opened 5 years ago

jggautier commented 5 years ago

While working on exporting more dataset metadata in Schema.org (https://github.com/IQSS/dataverse/issues/4371), the team talked about the best way to include the URLs of related publications. (In Schema.org, it's best practice to use a URL, instead of an ID number, for related publications.) But not all depositors/curators enter related publication metadata the way we expect (the way the fields have been designed), which has and will lead to significant curation work to correct the metadata so that it's exported in different metadata formats (I'm thinking of Schema.org and DataCite).

This is what the Dataverse 4 relatedPublication compound fields look like: screen shot 2018-11-05 at 1 00 51 pm

For the Schema.org issue (https://github.com/IQSS/dataverse/issues/4371), we decided to use what's entered in the URL field.

But there are plenty of cases where:

So not all possible related publication metadata will be exported, and discoverable in other systems, without curation work to update the metadata after it's been published.

For DataCite schema, Dataverse needs to know what are the ID Types and identifiers of related publications, other datasets and software. For Schema.org, Dataverse needs to know what's the URL of related publications. (As of this issue, there's no recommended way to include in Schema.org metadata about other related datasets and software.)

For other datasets and software, Dataverse doesn't have fields that are meant to collect URLs, ID Types or ID Numbers:

Field for related software used to create the data (this might change with a planned software metadata block): screen shot 2018-11-05 at 1 30 27 pm

Field for related datasets: screen shot 2018-11-05 at 1 30 46 pm

Can we improve the metadata fields (such as the number of fields, how they're labelled and described) and/or how Dataverse uses what's entered in metadata fields (such as text parsing, getting more information from external sources) in order to reduce the curation and training needed now to make sure Dataverse is more capable of exposing information about related resources?

jggautier commented 5 years ago

This is the metadata that the DataCite 4.1 schema accepts for relatedIdentifiers:

<relatedIdentifier resourceTypeGeneral="Dataset" relatedIdentifierType="DOI" relationType="IsIdenticalTo">10.25377/sussex.7133897</relatedIdentifier>

(There are small controlled vocabs for resourceTypeGeneral, relatedIdentifierType, and relationType (see https://github.com/IQSS/dataverse/issues/2778).)

For the relatedIdentifier value, where the identifier 10.25377/sussex.7133897 is, the schema technically allows anything. We could add the URL form of the DOI instead, i.e. https://doi.org/10.25377/sussex.7133897 and it would be valid against the schema:

<relatedIdentifier resourceTypeGeneral="Dataset" relatedIdentifierType="DOI" relationType="IsIdenticalTo">https://doi.org/10.25377/sussex.7133897</relatedIdentifier>

But can DataCite use the URL form (e.g. for exposing related identifiers for MakeDataCount (https://github.com/IQSS/dataverse/issues/4821))? Or are they expecting only the identifier?

amberleahey commented 5 years ago

Just had a conversation with a user who would also like to see the related data set field be more structured to support discovery and readability online (as the related publication one is structured). Currently they are adding HTML tags to format this but could be enhanced directly in the form. Thanks for considering!

pdurbin commented 5 years ago

@amberleahey this is great feedback. Thanks. Please see also some recent discussion about "Related Dataset" in the context of Make Data Count at https://github.com/IQSS/dataverse/issues/4821#issuecomment-440756877

jggautier commented 4 years ago

"Related material" is as another unstructured metadata field used to record how another research object is related to the dataset. I've been looking into how people depositing datasets in Dataverse installations use that field (what information do they put in it), and what "related material" metadata Harvard Dataverse stores when it imports DDI metadata from non-Dataverse repositories.

DDI's definition of related material sounds like a notes field:

Describes materials related to the study description, such as appendices, additional information on sampling found in other documents, etc. Can take the form of bibliographic citations. This element can contain either PCDATA or a citation or both, and there can be multiple occurrences of both the citation and PCDATA within a single element. May consist of a single URI or a series of URIs comprising a series of citations/references to external materials which can be objects as a whole (journal articles) or parts of objects (chapters or appendices in articles or documents).

Example

<relMat> Full details on the research design and procedures, sampling methodology, content areas, and questionnaire design, as well as percentage distributions by respondent's sex, race, region, college plans, and drug use, appear in the annual ISR volumes MONITORING THE FUTURE: QUESTIONNAIRE RESPONSES FROM THE NATION'S HIGH SCHOOL SENIORS.</relMat> <relMat>Current Population Survey, March 1999: Technical Documentation includes an abstract, pertinent information about the file, a glossary, code lists, and a data dictionary. One copy accompanies each file order. When ordered separately, it is available from Marketing Services Office, Customer Service Center, Bureau of the Census, Washington, D.C. 20233. </relMat> <relMat>A more precise explanation regarding the CPS sample design is provided in Technical Paper 40, The Current Population Survey: Design and Methodology. Chapter 5 of this paper provides documentation on the weighting procedures for the CPS both with and without supplement questions.</relMat>

I've found so far that people who use that field often enter related research objects that could be described as one of DataCite's resourceTypes (pg. 43) and aren't as wordy about describing the resource: https://dataverse.harvard.edu/dataverse/harvard?q=relatedMaterial:*.

What's entered in related material is also less directly related to the dataset, maybe because the tooltip for related publication is "Publications that use the data from this dataset."

marcomarsella commented 4 years ago

I manage a system assigning DOIs to Plant Genetic Resources (PGRs). Many of our stakeholders deposit datasets to Dataverse. It would be very useful to be able to cite the DOIs of the PGRs to which a dataset refers because this connection would be fed to EventData, a service that we already use to discover data citation of our DOIs from papers and other publications. The cited DOIs should be put in the <relatedIdentifiers> element of the metadata schema with relationType="References". This would make DataCite automatically record the citation in EventData.

jggautier commented 4 years ago

That was fast. Thanks @marcomarsella (for following up from our conversation over email)!

There's been discussion about which of the many relationTypes to use (some of the discussion is in the issue at https://github.com/IQSS/dataverse/issues/2778) and I think that would have to be resolved.

It sounds like you're advocating for always using relationType="References", regardless of the type of related research object. Is that right? Is the system you manage picking up related DOIs only when the relationType is "References"? Would it be okay to use a "one-way" relation or edge to describe every relationship between two datasets?

I forgot to mention that there are some Dataverse repositories that are already sending some of this information to DataCite (the two I'm aware of are QDR and Repositorio de Datos del Consorcio Madroño). But each is using one relationType to indicate a relationship between the datasets they publish and related text-based research objects (mostly journal articles and reports), so using the same relationType makes more sense to me.

I pointed to this issue because it's about all of the types of information Dataverse's metadata fields need to capture but aren't (at least not in a structured way), like persistent IDs. So Dataverse would need more fields, need to be able to parse persistent IDs from unstructured text, or somehow get information from other systems about datasets that are related to the datasets published by the Dataverse repository.

marcomarsella commented 4 years ago

The operators that EventData would capture are listed in https://support.datacite.org/docs/relationtype_for_citation

However, to avoid confusion and to facilitate querying, I think it would be good to stick to one relation operator only.

M

On 29 Jun 2020, at 18:44, Julian Gautier notifications@github.com wrote:

That was fast. Thanks @marcomarsella https://github.com/marcomarsella!

There's been discussion about which of the many relationTypes to use (some of the discussion is in the issue at #2778 https://github.com/IQSS/dataverse/issues/2778) and I think would have to be resolved. But it sounds like you're advocating for always using relationType="References", regardless of the type of research object. Is that right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/IQSS/dataverse/issues/5277#issuecomment-651235879, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADNUBUI2BS5UNGCRET67JI3RZDAGXANCNFSM4GB4EP2Q.

pkiraly commented 2 years ago

@jggautier @pdurbin We have a similar feature request from a Göttingen research group Digital Geochemistry Infrastructure (DIGIS). I would like to work on this feature, but I am unsure which ticket is the best one. There are some parallel which are mentioned in this ticket or in the tickets mention this ticket. I haven't seen any commit message or pull requests in this direction. Are you OK if I start working on it?

The implementation doesn't seem to be rocket sience at the first sight, all we need to do is modify a single method: DOIDataCiteRegisterService.generateRelatedIdentifiers(), and implement XML generation out of fields such as

applying the appropriate Datacite attributes.

qqmyers commented 2 years ago

FWIW: #2778 has lots of discussion about how various Dataverse fields could be mapped to various DataCite relationship types. As you say, implementation isn't hard (see https://github.com/QualitativeDataRepository/dataverse/blob/8a597dbd8c1f5e28f1cf7403efdba642d2bc6751/src/main/java/edu/harvard/iq/dataverse/DOIDataCiteRegisterService.java#L578-L660 for example), but finding mappings that work for everyone (or are configurable, etc.) is the challenge.

poikilotherm commented 2 years ago

As this is related to #8108 and my own #7077 and maybe #7844, I was wondering if this would be a good opportunity to start creating data model classes + DTOs and use JAXB annotations to transform into XML (+JSON via JSON-B, ...) 🤔

The current template based approach might be not extensible/configurable enough for the future.

(If this is out if scope for you @pkiraly I might bring this to the table again later when striving for software deposition support)

Edit: This might be also useful for the creation of transformations/converters like @qqmyers suggested. Maybe it would even be a foundation to allow importing from other metadata schemas than just Dataverse-JSON or DDI...

Edit 2: Haha, look at what gems are out there. https://github.com/CNES/DOI-server/tree/master/server/src/main/java/org/datacite/schema/kernel_4

pkiraly commented 2 years ago

@qqmyers @poikilotherm If there are different use cases which requires different mapping, then we should make it configurable instead of waiting long (this ticket is more than 3 year old) for a consensus, right? So we can add some resource files with mapping, and a config option which can be used to change to another existing resource or a file outside of the dataverse.war.

But if we use JAXB approach (which I prefer) with annotation, how can we do it dynamically? Is there a way for that other than reflection methods? My knowledge is not that deep about JAXB, but maybe you have an idea. If not, I will investigate it.

poikilotherm commented 2 years ago

I'd separate concerns. As there might be different versions of these schemas (as is with Datacite), it would be handy to represent these fine details within the view layer of a DTO which is serialized or deserialized via JAXB/JSON-B. (MOXy, the JAXB reference implementation has some nice stuff for this)

The mapping between the DTO and a Dataset object (with access to it's metadata objects wrapped in their own classes etc) might get complicated, so I'd rather use a static class plus interfaces.

Using an interface from the start opens up to load custom converters via plugins later. We might ship a few with Dataverse and leave it to the admin to provide more / override.

(It would also be great to make the DTOs pluggable, so one could add new metadata formats, i.e. CodeMeta, more easily...)

Maybe it's time to create a new issue for this and write up a concept before starting any coding? Again, this would make many tasks for HERMES easier to achieve, so I can add dev time to this.

pdurbin commented 2 years ago

Are you OK if I start working on it?

@pkiraly as a fan of rough consensus and running code I'd say if you have the cycles to put together a small-ish pull request, please go ahead. I'd time box it to a few days and please check in if you have any questions or concerns along the way. Also, you're very welcome to create a new, fresh issue that explains the scope of what you have in mind. If nothing else, hopefully we'll all learn something from what you put together.

jggautier commented 2 years ago

Hey @poikilotherm. It's tough for me to follow the recent conversation, but it seems to be about making it easier for Dataverse repositories to decide how related resources ("publications," datasets, etc.) are related to the deposited dataset, using or mapping to one of DataCite's relationType values. Is that right?

In the DataCite schema, DataCite's RelatedIdentifier property wants only "globally unique identifiers", and the issue with most of the related fields that ship with the Dataverse software is that they are free text fields. The form doesn't ask depositors to enter only identifiers. In your earlier comment, you wrote that you could "implement XML generation out of fields". Does this mean implementing some way of detecting if what's entered in the free text fields contain an ID?

pkiraly commented 2 years ago

@jggautier @pdurbin I have create a first pull request to export Related publications into DataCite metadata.

@poikilotherm @qqmyers The implementation is not optimal, I do not use JAXB or configuration, and it exports only one additional compound field. I haven't found a better way to fetch a particular field from the dataset than iterate over the fields, and check if the field name is equal to what we are looking for, but maybe there is a better, direct method.

Please check it, and tell me your suggestions.

BTW: Do you happen to know which version of DataCite should we support?

jggautier commented 2 years ago

The "DataCite" export references DataCite 4, while the OpenAIRE export references DataCite 4.1. We could look into all of the changes made between 4.1 and the latest 4.4. Maybe it would be faster in the short term to have Dataverse export a DataCite XML doc of a dataset that has all metadata fields filled, see if it validates against the latest version's XSD, DataCite 4.4, or see how it doesn't and why, figure out what changes would be needed to make it validate and if/when those changes could be made.

In addition to @qqmyers insight into how QDR adds related publication metadata to its exports, would it be helpful to see code-wise how related publication metadata is included in Dataverse's OpenAIRE export?

Are you recommending the use of "Cites" as the relationType used for all related publications. I think we'd want the other direction, isCitedBy, which is what the OpenAIRE export uses. (Sorry if I'm misinterpreting what I saw in the Java file. I don't know Java but just saw "Cites" in the code.)

pkiraly commented 2 years ago

@jggautier Thanks for the clarification about the version.

Regarding "Cites": you are right, IsCitedBy would probably be more relevant. I discussed it with our user. She wrote:

I can see occasions for both “IsCitedBy” and “Cites”. Is it possible to allow both? “IsCItedBy” only applies to the publication that is directly associated with the data. In our specific case we are dealing with new data compilations that use previously published data – i.e. the new compilation “cites” a number of previous publications (data and/or manuscripts). “IsCitedBy” would be wrong for these previous publications. An alternative in this instance would be to use “IsDerivedFrom”. However, at the moment this relationType is not recognised by Scholix and the citation would then not appear on the relevant external publisher’s page (this is the last I heard at least, I do not know at which point of the chain this happens).

This would lead to another related feature request. Right now the user interface of Dataverse does not allow users to select the type of relationship between the dataset and the related publication. For this to implement we need an extra field describing this relation, and then the user could select it from a fixed list. However this approach requires a number of changes, because we should modify the underlying data structure, its internal handling, the UI, and also we should check the consequences on the APIs. If the community accepts this idea, it would take quite an effort.

So I think as a first step we just change Cites to IsCitedBy, and make it selectable (or configurable) would be a second step.

jggautier commented 2 years ago

Changing Cites to IsCitedBy sounds good to me.

The goal of sending metadata about related resources to DataCite predates even this 2018 GitHub issue and I think the goal hasn't been realized because allowing depositors/curators to choose the type of relationship between the dataset and any type of resource (instead of only text-based resources like articles) will take more work from more members of the Dataverse community and that just hasn't been prioritized.

What your user wrote gave me the impression that they think the current "Related Publication" field can be used to record information about other data. Is that what the user thinks? The things entered in Dataverse's "Related Publication" fields are meant only for text based resources, like articles. This is because the field is equivalent and maps to DDI Codebook's relPubl element, whose description mentions only "articles and reports based on the data in this collection". The change being proposed to the field's tooltip (as part of our work on improving the tooltips of fields in the Citation metadatablock, https://github.com/IQSS/dataverse/issues/8127) hopefully clarifies this restriction. Dataverse's "Related Dataset" field, which also maps to a DDI Codebook element, is for related data.

marcomarsella commented 2 years ago

I appreciate the discussion but, unless I have misread something, it seems to me that attention is focused on related publications while nothing is being said for referencing the DOIs of other entities. In our case, for instance, it would be great if it was possible to list DOIs assigned to Plant Genetic Resources (PGRs) to which the dataset refers. In this case, the operator to use is clearly "References" and, submitting the metadata to DataCite would cause the relationship between the dataset and the PGRs would be tracked by PIDGraph. Is there any chance of you guys considering this?

jggautier commented 2 years ago

Hi @marcomarsella. I agree, the recent discussion has been about related text-based publications like articles. I opened this GitHub issue and used the term "related resources" to express the need for depositors to describe the relationship between the dataset being deposited and any type of resource.

I think that considering your use-case will involve the research and design work that @pkiraly mentioned could be done in future steps.

amberleahey commented 2 weeks ago

Hi all, are there any updates on Related Publication and relation type? @qqmyers mentioned on a call recently QDR is working on this, would love to know more details and very excited to see this get into the core code! :)

jggautier commented 2 weeks ago

HI @amberleahey! The Dataverse UX working group will be working on this, potentially as one of the design sprints we've been planning in order to make changes to Dataverse in more collaborative, user-centered and timely ways. The working group's charter has more links about this, too.

Part of this issue and I think something that needs to be considered in the GitHub issue you opened is about describing how resources, like other datasets, are related to what's being deposited in a Dataverse installation. The NIH GREI group that Harvard Dataverse is a part of has said that they plan to do research about this in order to make recommendations about how repositories should describe these relationships and how they should send this metadata to DataCite. Involving the NIH GREI group would be great since DataCite itself and other repositories like Zenodo and OSF will be working on the recommendations and then following them, increasing interoperability among many more repositories.

Depending on the timing of that NIH GREI group's work, I hope it will help inform what we do for Dataverse. Or it might be the other way around, that the Dataverse community, through the work of the UX WG, decides how best to help depositors describe related resources and how to send that metadata to DataCite, and then we take those insights to the NIH GREI group in order to evaluate and influence recommendations that will be followed more broadly, including by the other GREI repository members like Zenodo and OSF.