IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
882 stars 494 forks source link

Access Rights metadata in OpenAIRE metadata export is being misapplied #5920

Open jggautier opened 5 years ago

jggautier commented 5 years ago

As part of v4.14 (released in May 2019), Dataverse makes available through the UI, API and over OAI-PMH DataCite metadata that complies with OpenAIRE requirements (https://github.com/IQSS/dataverse/issues/4257). Repositories need to follow these requirements in order for their dataset metadata to be made discoverable in OpenAIRE EXPLORE.

The required metadata export called OpenAIRE (in the Dataverse UI) or oai_datacite (over API and OAI-PMH) includes one of four Access Rights terms, which come from the info:eu-repo-Access-Terms vocabulary:

Dataverse chooses these terms based on whether or not any dataset files are set to restricted and whether or not people are able to request access to those restricted files using Dataverse's request access feature:

There are datasets in Dataverse repositories whose files are set to restricted, and people cannot request access through Dataverse's request access feature. The OpenAIRE metadata export for these datasets uses closedAccess, even when the dataset metadata indicates that people can request access by some process that happens outside of Dataverse's request access feature, e.g. submitting a DUA or contacting the author.   Untitled-1 This dataset has restricted files and people aren't able to request access through Dataverse's request access feature, so its OpenAIRE metadata indicates that the dataset is closed access. But people are able to request access by filling out a form (Application For The Use of Data), so the dataset isn't really closed access.

  When these datasets are harvested by OpenAIRE, because the metadata says they're closedAccess they'll appear and be searchable as closedAccess, grouped with datasets that are more appropriately labelled closedAccess, even though file access is only restricted. This may make these datasets harder to find and use, making OpenAIRE EXPLORE less effective for finding datasets published by Dataverse repositories.

We can think of better ways for Dataverse to assign rights access terms in ways that the Dataverse community thinks are more appropriate (e.g. Zenodo depositors choose from a drop-down menu). But other data publishers are using these rights access terms (or those terms are being applied to the harvested datasets) in a variety of ways that can make using the Access Rights filters unhelpful for searching through OpenAIRE EXPLORE. "Open data" already means many different things to different groups. Since these Access Rights terms are used for the benefit of finding data in OpenAIRE EXPLORE, the scope of this issue might involve learning how OpenAIRE might want to improve the definitions and how repositories can use them in more standardized ways.

jggautier commented 5 years ago

I wonder if it might be safe to never use "Closed Access", use "Restricted Access" for datasets that have restricted files, and use "Open Access" for all other datasets. Does anyone ever publish datasets whose files can't be accessed at all?

If so, it might help if Dataverse allows depositors to indicate, in a standardized and machine-readable way, that access to restricted files can be requested (even if people need to request access outside of Dataverse's request access feature) or cannot be requested through any means

cmbz commented 3 months ago

To focus on the most important features and bugs, we are closing issues created before 2020 (version 5.0) that are not new feature requests with the label 'Type: Feature'.

If you created this issue and you feel the team should revisit this decision, please reopen the issue and leave a comment.

philippconzett commented 2 months ago

I only recently came aware of this issue. I think resolving this issue eventually depends on #4391 being resolved first. Thus, to me, it seems the Access Rights terms used by OpenAIRE and others (e.g., BASE Bielefeld) depend on Terms of Use being defined at file-level.

With support for file-level Terms of Use being implemented, I think things would work like this: At the metadata record level, thus the registered metadata at dataset or file-level should always be licensed with CC0 and thus have the Access Rights terms defined as "Open access". At file-level, all of the values can be used, based on the Terms of Use of the individual file at stake:

jggautier commented 2 months ago

@pdurbin and I talked about this issue in relation to https://github.com/IQSS/dataverse/pull/10737 and https://github.com/IQSS/dataverse/issues/8129. And I agreed that I'd open a new GitHub issue about dc:rights specifically, to help manage these different goals and scopes.

But @philippconzett, what do you think of using this GitHub issue instead, since we're already talking about the use of these "Access Rights terms used by OpenAIRE and others (e.g., BASE Bielefeld)"?

I could re-word this GitHub issue's title so it's clear that the issue is about all uses of these "Access Rights" terms, and edit the first comment for the same reason.

pdurbin commented 2 months ago

I wanted to link to something so I went ahead with the idea that this issue represents the unfinished dc:rights work that was originally part of the scope of #8129, which (if all goes will) will be closed by PR #10737.

The next challenge will be to size it, of course, and figure out what the plan is and when. 😅

philippconzett commented 2 months ago

@jggautier @pdurbin Thanks for moving this forward. I think both approaches could work, thus continuing using this issue or creating a new one.

jggautier commented 2 months ago

Thanks. https://github.com/IQSS/dataverse/issues/4176 is also about changes to what's included in dc:rights and we'll need to consider the points raised there, too.

Next week I'll try to find time to help think about either using this GitHub issue or creating a new one, but with other projects and work travel next week, I'm not sure. I definitely don't have time this week.

jggautier commented 3 weeks ago

So I definitely didn't have time "next week" lol. I'm going to try to sneak some time in today to continue the discussion.

I'm going to keep using this GitHub issue for discussion about how access rights metadata in OpenAIRE metadata is being misapplied.

@philippconzett, I have questions and comments about what you wrote:

I only recently came aware of this issue. I think resolving this issue eventually depends on #4391 being resolved first. Thus, to me, it seems the Access Rights terms used by OpenAIRE and others (e.g., BASE Bielefeld) depend on Terms of Use being defined at file-level.

With support for file-level Terms of Use being implemented, I think things would work like this: At the metadata record level, thus the registered metadata at dataset or file-level should always be licensed with CC0 and thus have the Access Rights terms defined as "Open access". At file-level, all of the values can be used, based on the Terms of Use of the individual file at stake:

  • openAccess: If the file is not set to restricted or embargoed, the metadata export at file-level should use "openAccess".
  • restrictedAccess: If the file is set to restricted and the option to request access is enabled (people are allowed to request access using Dataverse's request access feature), the metadata export at file-level should use "restrictedAccess".
  • closedAccess: If the file is set to restricted and the option to request access is disabled, the metadata export at file-level should use "closedAccess".
  • embargoedAccess: If the file is set to embargoed, the metadata export at file-level should use "embargoedAccess".

OpenAIRE uses their OpenAIRE standard to determine if a dataset is openAccess, restrictedAccess, closedAccess or embargoedAccess.

It sounds like you're proposing that the OpenAIRE XML exports of datasets would always indicate that the metadata of the dataset is CC0 and "openAccess". Am I understanding that right?

If so, as far as I know, the OpenAIRE standard doesn't have a way to indicate the terms or license of the metadata. As we know, it includes a way to indicate the license or terms of the data that the metadata describes, and I think that's all it can do.

And I think that being able to describe the terms or license of the metadata of the dataset is out of scope here. OpenAIRE's system wants to know the access level of the data in the dataset, using just one of those four access levels. This GitHub issue is about challenges with providing that information to OpenAIRE. The use case I described in this issue's first post assumes that a dataset can be usefully described with just one of the four access levels.

But that model doesn't work when one dataset has data with multiple access levels right? I think that's the gist of your comments. And if we want to resolve that, then I think it means also working with the OpenAIRE folks so that their systems can support searching for datasets by access level when those datasets have multiple access levels because the datasets' files have multiple access levels.

Does all of the make sense? I'd like to make sure before we start thinking about solutions.