Spike: Investigate effect of spam prevention on dataset publishing in the Harvard Dataverse

jggautier commented 1 year ago

I think some research into the affect of the spam prevention code on Harvard's repository might help us determine the urgency of improving how Harvard Dataverse handles spam, so that more people can continue sharing data as quickly as possible.

For example, could we try to get a better idea of how often the spam prevention is flagging and preventing the publication of non-spam, and why? We could look at the number of times that people have emailed the repository's support about this, since those emails are recorded in RT.

And some people affected by this might not email support. They might try to create a different dataset or they might abandon the dataset and try a different repository.

To get a better sense of how often this happens, we could find unpublished dataset versions that have been or would likely be flagged as potential spam (for example, because there are URLs in their description metadata fields, which the spam detection doesn't like) and try to learn from the depositors if they haven't published those versions because of the spam detection.

In recent Slack conversations, there was discussion about how to improve the spam detection so that fewer non-spam deposits are affected. It was suggested that Dataverse collections could be added to a safe list so that any dataset versions published in those collections would never be flagged as potential spam.

If some unacceptable number of non-spam datasets deposited in "Root" are being flagged as spam, what could be done?

sbarbosadataverse commented 1 year ago

Soner left the following message " I wanted to bring this issue to your attention. Since we put in the spam filter service, my team and I have resolved over 100 tickets in the last 8 weeks or so, users can’t publish their datasets, and we need to go in and publish perfectly fine datasets for them. Some users are ok with us doing this step, and some are curious about this new service temporarily or not... annoyed by it. In some instances, we have to contact Leonid and have him whitelist their Dataverse so that users can publish their datasets and not contact us every time they make a minor change. I wanted to find out if this Spam filter service is temporary and if a new solution is coming soon and we will not need to publish users’ legit datasets for them. Sonia and Ceilyn know this; we chatted a few weeks ago, but they haven’t had a chance to bring it to your attention."

cmbz commented 1 year ago

Moving issue to Needs Sizing column. Once sized, it can be prioritized and worked on post DCM 2023.

cmbz commented 1 year ago

2023/07/17: This issue will be prioritized after the USCB dataset support has been resolved (due to resource constraints; Leonid will be participating in USCB support)

One possibility is for the system to notify the curation team whenever a dataset trips the spam filter. Then, the curation team could publish the dataset on behalf of the user. The user would receive a message to the user indicating that the curation team will publish it. No RT ticket would need to be created by the user and it could reduce negative interaction with users/Soner's team

cmbz commented 1 year ago

See also related issues/PRs to address the spike issue incrementally:

cmbz commented 1 year ago

2023/08/28 To complete the work needed to improve user experience of spam handling, we need two issues to be written, sized, and prioritized:

Implement mechanism to automatically email support when dataset is trapped in spam filter
Update "unable to publish" messaging workflow (should be updated after automatic emails are sent to support (and will not go to the user)

landreev commented 11 months ago

I'm going to use this spike to document and discuss some incremental improvements to the content validation in prod. (this has been discussed previously; but I can open a new issue for that instead).

I am ready to switch to the new model of handling datasets for which an attempt to publish has triggered an alarm from the validation script, just need a confirmation that what's described below is ok (and that everybody in support is aware of the changes):

Once the switch is made, the users will no longer be instructed to contact support in order to get their dataset published. Instead, an RT ticket will be opened automatically, so that it can be reviewed and published, or deleted, as needed.

The following new message will be shown to the user:

This dataset did not pass our automated metadata validation scans and cannot be published right away. Please note that this may be in error. The repository team has been notified and will publish your dataset within 24 hours. No further action is required.

An RT ticket will be opened in the standard dataverse_support queue and will look as follows:

Title: Potential spam dataset doi:10.7910/DVN/ZZZZZ, please review

Filter triggered for dataset
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ZZZZZ
Please review and publish or delete, as needed.

(this ticket was opened by the automated content validator script)

If any changes to the above text are needed, please let me know. Otherwise, we are ready to switch to this scheme.

landreev commented 11 months ago

I will be adding info about other potential changes.

jggautier commented 11 months ago

Since it's possible that the support team won't publish the dataset, maybe because it's actually spam, what about editing the third sentence to say that "The repository team has been notified and within 24 hours will either publish your dataset or contact you"?

landreev commented 11 months ago

@sbarbosadataverse I have configured it with your warning message (in my comment above). But if you want to change it, based on @jggautier's suggestion above or otherwise, just slack me the new message please, it can be modified instantly.

cmbz commented 9 months ago

2023/12/18

Some Improvements in spam handling have already been implemented, as outlined above
@landreev has additional suggestions for improvements and will include them here.

landreev commented 9 months ago

To recap, earlier in the fall we applied the first batch of improvements: most importantly, the one outlined above - switched to automatically generating RT issues for datasets that trigger the filter (instead of instructing the users to contact support themselves). Also, whitelisting mechanisms have been extended - it is now possible to whitelist specific collections and users, in addition to datasets.

As the next phase, I'd like to discuss a couple of extra changes that could further streamline and simplify the process, and (potentially) minimize bad user experience caused by false positives a bit more. The ideas below are the result of slack discussions with members of the curation team.

Consider disabling validation checks on collections altogether. We are still enforcing the policy of requiring most users to go through support in order to create collections. Is there a realistic danger that somebody who has convinced the support team that they are legitimate data depositors will proceed to post spam? The only users who are allowed to create collections are those authenticated via HarvardKey and the institutional logins of a couple of other trusted schools. May be safe-ish to assume that they are unlikely to create anything inappropriate either (?).
By the same logic as the above, should we consider disabling validation checks on the datasets in all the sub-collections (i.e. all the collections other than the top-level Harvard Dataverse collection)?

This way the only content we'll be validating will be the datasets in the top-level root collection. Which is the only place where we allow a truly random person to walk in, open an account and create a dataset.

Anything that I'm missing/any reasons any of the above is a bad idea?

pdurbin commented 9 months ago

So in short, perhaps we could trust new items in existing collections. Sure, worth a shot, I'd say.

landreev commented 9 months ago

So in short, perhaps we could trust new items in existing collections.

... And the collections themselves. As of now, we are validating the collection metadata as well, when they are published or updated. These checks generate false positives too. This is especially problematic with edits of already published collections. We cannot use the approach of opening an RT ticket and having the support review the changes; since there is no concept of versioning or drafts for collections.

landreev commented 5 months ago

@sbarbosadataverse Would you approve of this proposal I made here some time back? I posted about it in slack channels too and got positive feedback from some support/curation members. It's quoted below, but, in short - should we try to only run the spam filter on the datasets in the top-level, root collection? - And not in/on sub-collections, since those have to go through curation already.

To recap, earlier in the fall we applied the first batch of improvements: most importantly, the one outlined above - switched to automatically generating RT issues for datasets that trigger the filter (instead of instructing the users to contact support themselves). Also, whitelisting mechanisms have been extended - it is now possible to whitelist specific collections and users, in addition to datasets.

As the next phase, I'd like to discuss a couple of extra changes that could further streamline and simplify the process, and (potentially) minimize bad user experience caused by false positives a bit more. The ideas below are the result of slack discussions with members of the curation team.

Consider disabling validation checks on collections altogether. We are still enforcing the policy of requiring most users to go through support in order to create collections. Is there a realistic danger that somebody who has convinced the support team that they are legitimate data depositors will proceed to post spam? The only users who are allowed to create collections are those authenticated via HarvardKey and the institutional logins of a couple of other trusted schools. May be safe-ish to assume that they are unlikely to create anything inappropriate either (?).

By the same logic as the above, should we consider disabling validation checks on the datasets in all the sub-collections (i.e. all the collections other than the top-level Harvard Dataverse collection)?

This way the only content we'll be validating will be the datasets in the top-level root collection. Which is the only place where we allow a truly random person to walk in, open an account and create a dataset.

Anything that I'm missing/any reasons any of the above is a bad idea?

cmbz commented 2 months ago

2024/07/10

Any updates for this issue? What do next steps look like? @sbarbosadataverse @landreev

sbarbosadataverse commented 2 months ago

Opened a new Monitoring issue for spam in production:

[ ] https://github.com/IQSS/dataverse-pm/issues/291

cmbz commented 1 week ago

Assigning to @sbarbosadataverse and @landreev so they can provide update on status. Should this issue stay on hold? Something else?

IQSS / dataverse.harvard.edu

Spike: Investigate effect of spam prevention on dataset publishing in the Harvard Dataverse #221