Support measures of Data and Metadata Quality

djbrooke commented 5 years ago

On the redesigned dataset page (#3404), there's a spot for "Data Quality." We'd like to set up a workflow for curators, replication analysts, and machines to make assertions about the quality of a dataset. This can differ greatly from discipline to discipline. We’ve had good feedback from the community across several issues (#4751, #2119, #565, #924), which I’ve tried to capture as use cases below.

I want to use a framework to assess data quality and show the results of that assessment
I want to show a badge representing that the Data/Metadata has met some criteria or gone through some process
I want to show that the data/metadata has been peer reviewed
I want to crowdsource a quality measure for a dataset
I want to run an external tool that verifies the reproducibility (Code Ocean, for example), and then shows the reproducibility status in Dataverse
Part of the data deposit process/pipeline has a quality check and I'd like for the result of that process to be shown in Dataverse

This issue will focus on proposing a solution to cover as many of these use cases as possible (and others that I'm sure I'm missing) instead of trying to push forward individual issues.

scolapasta commented 4 years ago

One thing we've been recently discussing is what happens if someone uploads from, for example, Code Ocean and gets a badge for reproducibility, then makes edits to the files or path structure. In that case, the files are no longer in the same state they were at deposit time and attempting to reproduce with them in that stay may no longer work.

So, one idea I had is that if / when we have these badges, that we also track the state the files and path structure were in via a checksum. The idea being that when a badge is added for a set of files that are reproducible (which I imagine could be at deposit time from one of these tools), we calculate on the dataverse side one checksum that represents this state (this can be aggregated by doing checksums on a list of the paths of the files and the checksums of the files themselves).

This can serve multiple purposes: 1) if the depositor makes an edit before publish, we can check to see if state of the files / path structure has changed and either a) warn them b) remove the badge 2) if a user attempts to use the files they can calculate the checksum using the same process and confirm. In that way they know the files / path structure are the same as they were when the badge was awarded.

pdurbin commented 2 years ago

Good ideas and a good goal but is a three year old GitHub issue the best place to have this discussion? Should we start a Google doc and have some meetings? Should we write this into a grant? Should we see if a Dataverse installation or two will sponsor (i.e. pay for) this (worthy!) effort? When do we plan to work on this? Who is the champion?

Vote to close.

siacus commented 11 months ago

I think we should reopen the 'badge' feature discussion. The CAFE grant has resources and need to move forward with this.

pdurbin commented 6 months ago

Over at https://dataverse.zulipchat.com/#narrow/stream/375707-community/topic/labels.20for.20published.20dataset/near/435349197 @DS-INRA just wrote:

"Hello community, I was wondering if it was possible to set a label akin to "Incomplete metadata" or curation labels on a published dataset. The use case is be to indicate the datasets that have been curated."

This is more or less the badge discussion. I guess we could discuss more in this issue. Or in Zulip. I still think it might be nice to have a google doc and a conversation. And we should figure out the scope of this issue (or a new dedicated issue) so we can estimate it.

I'm sure this has been discussed in other issues, but I'd be remiss if I didn't point out that the way AJPS datasets like https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/WGWMAV let the world know that the dataset has been curated is by adding the following message and images under the "Notes" field:

"This dataset underwent an independent verification process that replicated the tables and figures in the primary article. For the supplementary materials, verification was performed solely for the successful execution of code. The verification process was carried out by the Odum Institute for Research in Social Science at the University of North Carolina at Chapel Hill.

The associated article has been awarded Open Materials and Open Data Badges. Learn more about the Open Practice Badges from the Center for Open Science."

Here's a screenshot:

DS-INRAE commented 4 months ago

Hi there, has there been news of the CAFE Use Case ? Is it documented somewhere :) ?

pdurbin commented 4 months ago

@DS-INRA you can find some info about CAFE at https://github.com/Climate-CAFE/Climate-CAFE.github.io

cmbz commented 2 months ago

To focus on the most important features and bugs, we are closing issues created before 2020 (version 5.0) that are not new feature requests with the label 'Type: Feature'.

If you created this issue and you feel the team should revisit this decision, please reopen the issue and leave a comment.

IQSS / dataverse

Support measures of Data and Metadata Quality #6041