IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
878 stars 486 forks source link

Reviewers using anonymous private URL might learn dataset author's identity from information about the Dataverse installation or collection #8184

Open jggautier opened 2 years ago

jggautier commented 2 years ago

Information about the installation and about the Dataverse collection that the dataset is in could help reveal the dataset author's identity to the dataset reviewer.

Information about the repository housing the dataset:

The anonymous private URL page shows the name of the Dataverse repository/installation that the dataset is in and the reviewer can navigate around the repository website to find more information about the repository. This could be an issue for Dataverse repositories with a more narrow/focused audience, like the repositories that only allow researches affiliated with a certain institution to deposit datasets.

Information about the Dataverse collection housing the dataset:

The anonymous private URL page shows the name of the Dataverse collection that the dataset is in, even if the Dataverse collection is unpublished. This feature was meant mostly for "Journal Dataverse collections," (#1724) so we should expect that the reviewer would already know, before ever visiting the anonymous private URL page, that the dataset is associated with a particular journal.

But the depositor's dataset could be in a Dataverse collection whose information (such as collection name or description) could be used to identify the author. This point was also brought up in two comments (1, 2) in the original GitHub issue. For example, many collections include the researcher's name because when people create Dataverse collections, the Dataverse software prefills the "Dataverse Name" field with the name of the Dataverse repository account that created the Dataverse collection. This is often the author's name, and the reviewer can see that Dataverse collection name, in the breadcrumbs on the anonymous private URL page.

djbrooke commented 2 years ago

Thanks @jggautier!

I mentioned in Slack that I think it would be challenging to implement a programmatic fix for this, as you'd need to obscure the collection name and also potentially obscure names of other datasets, subcollections, parent collections etc. We could also revisit the functionality generally in order to not allow navigation off the dataset, but this would be a big change as well - right now the application just creates a temporary user that allows the access. Food for thought if we're able to prioritize this at some point in the future.

jggautier commented 2 years ago

Thanks. Do you think users might share anonymous URLs before they realize that reviewers might see information about the repository or the dataset's Dataverse collection that could give away who the author is?

A careful depositor might check the URL before sharing it and realize this, but I think there are things we could do to increase the chances that most users will realize this, like adding this info in the User Guides or in the popup.

djbrooke commented 2 years ago

@jggautier Oh yeah, option three - better explanatory text. :) All for it!

jggautier commented 2 years ago

Hi everyone. Changes in the UI that @TaniaSchlatter and I are proposing are in the PDF, Proposed changes to Anonymous Private URL.pdf. The PDF has two boxes, the first describing how the feature works as designed now (v5.7) and the second describing changes based on reviews by the curation team at Harvard Dataverse Repository and @kmika11's review with some researchers who've needed to share their datasets anonymously.

Changes to the User Guides section about this feature are in the Google Doc at https://docs.google.com/document/d/1bn4fIPr_yhOj_DYDldzdKEZjmETV-WLYc98sWTgcg58.

The changes are meant to address the issue described in this GitHub issue as well as address confusion about the differences between the two types of URLs (https://github.com/IQSS/dataverse/issues/8185). (@jeisner brought up other points in an older GitHub issue, particularly about being able to anonymously share a dataset that's already been published, that this feature doesn't address.)

The next steps are:

philippconzett commented 2 years ago

Thanks for sharing the progress on this feature! The proposed changes look all good to me. The term "Prepublish URL" is clearer than "Private URL", and the descriptions in the pop-up windows and in the user guide are all very clear. I think the anonymized version of the Prepublish URL is mainly useful in cases where a dataset is part of a double-blind peer-review process. I have added a note on this in the Google doc.

As mentioned earlier, in DataverseNO, we use a special, unpublished collection for datasets that are part of a double-blind peer review process. Page 12 in this presentation summarizes how DataverseNO currently supports double-blind peer review. See also this fake example of an anonymized dataset in our double-blind peer review collection.

Maybe an easy(?) way to enhance the Prepublish URL feature even more, could work like this: When the depositor or curator (depending on the access rights) clicks the Prepublish URL button and selects Anonymous Review, a copy of the dataset will be pushed into an anonymized collection like the double-blind peer review collection at DataverseNO, and the copy will be anonymized following the current Anonymous Review feature.

That way, the name of the repository would still be revealed, but the collection would be anonymized.

jggautier commented 2 years ago

Thanks @philippconzett :) I think for now we've decided to change the layout and the text on the popup to help the depositor understand the limitations of the feature, like how the collection name can help reveal the the authors' identities.

I'm all for opening another issue specifically for discussing ways to remove that limitation. @djbrooke and @TaniaSchlatter, what do you think?

To get more feedback about the redesigned popup, we reviewed it with 6 people - 5 people who I found used workarounds to deposit datasets in Harvard Dataverse Repository for anonymous review and 1 person who manages a journal's Dataverse Collection in the repository and has been interested in support for anonymous review. The redesign seemed to work well and I made small text-based adjustments based on the feedback:

These screenshots of the popup show the text changes:

Screen Shot 2021-12-16 at 9 22 52 AM

I also split the last block of text in two to improve readability, clarified that the files will be "accessible" if they're not restricted, and changed "data files" to "dataset's files". We heard during the review of the metadata tooltips that "data files" could be interpreted to exclude other types of files like "documentation files" and "code files", so I think it's better to use broader language here.

We learned more about the feature in general, including how discoverable it is (or isn't), and we heard things about the journal review process that I think we need to learn more about, so I'm working on summarizing that feedback and recommending next steps.

TaniaSchlatter commented 2 years ago

The wording and layout changes outlined above should help from the UI perspective, however moving them forward is not a complete programmatic fix.

meghangoodchild commented 2 years ago

Thanks for the opportunity to provide feedback. The anonymous review feature is certainly a desired feature.

Based on some discussions with members from our community, we learned about several experiences where researchers have used the private URL in their article's data availability statement (instead of the DOI). We would like to stress the importance of using terminology that emphasizes the temporary nature of the URL, such as "temporary prepublish URL" or "prepublish preview URL".

philippconzett commented 2 years ago

We have had the same experience as @meghangoodchild describes - although we have emphasized for the depositors that they must replace the private URL with the DOI before the article is published. Right now, this is the case in a Nature article that was published several months ago and so far, we have not been able to make Nature replace the private URL with the DOI. As a result, we cannot publish the dataset, because that would cause the private URL to be deactivated and the dataset URL in the reference list of the article would thus no longer resolve.

Maybe you could include some explicit wording in the private URL feature that makes depositors aware of the importance of making sure that the dataset reference in the final article must contain the dataset DOI, not the private URL.

jggautier commented 2 years ago

Thanks @meghangoodchild and @philippconzett. We've heard the same thing, and I saw that figshare mentions in their guides that their "private link" shouldn't be used to cite the data in publications. I'm proposing adjusting the name of the feature and adding a line in the popup (and in the User Guides) about how the dataset's PID should be used to cite the data in publications:

Screen Shot 2022-01-12 at 3 08 38 PM

Because the name of the feature is in the URL, too, if the name makes the temporary nature of the link more obvious, hopefully it'll be more obvious to researchers and journal editors just by looking at the URL, e.g. https://demo.dataverse.org/previewurl.xhtml?token=0f04F8c2-bcer-4adf-816d-3b950c73ddce

But like I mentioned in emails, we'll be trying to contact journals and publishers to learn more about why authors have been adding this temporary URL to their articles in the first place and why there's friction when that URL needs to be replaced by the PID before the article is published. We've seen journal and publisher policies, like Springer Nature's policies, that I'd think are pretty explicit about using persistent IDs in articles to cite data. Is there anything about a publisher's or journal's processes that contribute to this friction?

We've also seen that sometimes researchers don't realize that the PIDs of unpublished datasets will "work" (lead to the datasets) once the datasets are published. Would making this fact more obvious encourage researchers to cite datasets with PIDs instead of private URLs?

jggautier commented 2 years ago

@TaniaSchlatter agrees that the redesign of the feature name, the popup, banner messages, and relevant guide pages are done and can be moved to development when possible.

The changes are illustrated in mockups in an image and in a section of a virtual whiteboard. They include changes to:

The changes to the guide pages - pages in the User, API, Installation, Developer and Style guides - are in the Google Doc at https://docs.google.com/document/d/1bn4fIPr_yhOj_DYDldzdKEZjmETV-WLYc98sWTgcg58

The change to the name of the feature will require changes to the names of associated code files, e.g. PrivateUrlUtil.java

mreekie commented 2 years ago

Worked on by

scolapasta commented 7 months ago

Removing information about the Dataverse collection should be relatively straight forward to not render (would have to see how it looks), i.e. don't show the dataverse collection header, don't show breadcrumbs.

Repository name can't be hidden as it's part of the URL.

Currently not sure about the past versions - since I'm not sure if the persistent ID is exposed in this; if it is, then an end user could use that to find the dataset. If it's not, then we should be able to not render the version tabs.

**Still need to determine how to deal with published versions - would we have published versions of a dataset needing anonymous review? @jggautier If this is a case, then we code to not allow the creation of an anonymous link once a dataset has one published version.


Other considerations: *We still have the lingering issue of small repositories being at risk of identifying information exposure.
-Julian had suggested not providing a URL (which contains the repository name) but instead providing a PDF of the data to avoid interacting with the repository identifying information -What alternatives can we consider?

sbarbosadataverse commented 7 months ago

From the comments on Jan 25, 2022: This task needs review (@scolapasta_: "Changes to the text, layout, and interaction of the feature's popup"

@jggautier can you explain further what this particular change would fix?

scolapasta commented 7 months ago

So reviewing with @qqmyers it does seem that anonymos peer review can only happen for initial drafts - which means there are no previously published versions. That means there's no code/logic to worry about there, but also that the suggested popups above don't even need to mention previous versions in that case.

jggautier commented 7 months ago

The changes are meant to:

I didn't know that anonymous peer review is available only for initial drafts of the dataset, but that's great!

adam3smith commented 7 months ago

So this is only changes the guidance, correct? I think changes look good and worthwhile, though ime you shouldn't expect big effects on user behavior from any written text.

jggautier commented 7 months ago

Yeah changes to text, and also the popup's layout and interaction

sbarbosadataverse commented 7 months ago

My next question is can we get this done and on the list of prioritization @scolapasta @jggautier Any blockers to the changes we need to make to have this work on HDV?

qqmyers commented 6 months ago

FWIW: The text proposed, at least as of https://github.com/IQSS/dataverse/issues/8184#issuecomment-1011170056 indicated that restricted files are not available with the anonymized preview URL - that is not currently the case. If this is desired, I suspect it may need to be an option (presumably some review requires looking at the restricted files?). #10403 is proposing to allow this in general (allowing users to be given the ability to view unpublished datasets but not restricted files), but that doesn't necessarily change how anonymized preview Urls work, so that would have to be handled somewhere. In any case, the text change here shouldn't include that unless/until the functionality is changed.