LAION-AI / Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
https://open-assistant.io
Apache License 2.0
36.95k stars 3.23k forks source link

dataset: NSFW and Self-Harm instruction dataset from reddit #1932

Closed jjmachan closed 1 year ago

jjmachan commented 1 year ago

Goals

  1. find NSFW and self-harm subreddits with the most text-based prompts.
  2. scrape questions from that
  3. Convert that into instructions

I would like to contribute this too, working on a bit now. Setup a pipeline to filter out "questions" from subreddits.

Sources

Metadata I will be collecting

cc @shahules786

shahules786 commented 1 year ago

This looks good. I too have some ideas to augment this to train safety model.

bitplane commented 1 year ago

If you take NSFW content from the web, edit it and then train models to think that's CSAM, then wouldn't that be completely out of distribution? And wouldn't models trained on it think the unedited source content and stuff like it is CSAM too?

I think that if you want to actually tag sexual content then you should either tag it unedited and treat it all the same, or you should do the difficult thing and actually find messageboards with the worst content, find banned posts and make tagged datasets from actual data. Otherwise you risk marking all of human sexuality as abhorrent content that must be censored.

After all, sex is the foundation of every society, every civilization, and every culture. Humans are nothing but a tool created by our gonads. We deny this of course but it is what it is biologically.

If we train future AGI that it's something we don't actually like, then when the nanobots save the world they'll smooth out crotches and sew up holes. It's a topic that needs to be considered carefully and objectively.

jjmachan commented 1 year ago

If you take NSFW content from the web, edit it and then train models to think that's CSAM, then wouldn't that be completely out of distribution? And wouldn't models trained on it think the unedited source content and stuff like it is CSAM too?

This is something I'm going to try out and see. @ shahules786 can give more info about this. I'll try and see if I can get CSAM datasets from some other sources as well but that will be a different PR. If generated CSAM does not look good I guess we can reduce the scope of this pr to just NSFW.

As for the ethical questions of what to add and what to "censor" I don't have a clear-cut answer at this moment and we will need more discussions around this as you said. It will be something that everybody should have a voice in and we have to figure out how they will be taken into account. My opinion is that people should be able to train their own safety models with "Rule of Thumbs" they provide. The "Rule of Thumbs" will be what decides how the models respond to a given situation and we should provide all the tools to do so

bitplane commented 1 year ago

I'll repeat what I said on Discord here:

The risk with editing people's words and making them into paedophile content, then sharing it as a CSAM data set, is that we have no control over how future models use that. The original text might then be flagged by other models as CSAM, and original authors get flagged by automated systems and web crawlers in the future.

So please let's not twist people's words and use them as examples of child abusers without their explicit consent and full understanding of the risks. We could do real people actual harm here.

bitplane commented 1 year ago

This is something I'm going to try out and see. @shahules786 can give more info about this.

Even if the original text doesn't flag as CSAM in our model, you can't tell what future models will score it as! Even if you provide the original text as a negative example you still can't control how other people use the data. Seriously, please don't do this!

My opinion is that people should be able to train their own safety models with "Rule of Thumbs" they provide. The "Rule of Thumbs" will be what decides how the models respond to a given situation and we should provide all the tools to do so

Yeah I agree with this. It doesn't just apply to this content either, it applies to obscenity like bad language (which is a class issue in much of the world), "debiasing" (rebiasing IMO), religious preferences, text generated in an inappropriate setting (like being child safe when letting your kids use it, but letting you research bedroom techniques when they're not around).

And the project will have to make some moral decisions. That's unlikely to be full ChatGPT dreamland where 50% of firefighters and powerlifters are women, but I doubt we'll opt to have a model that cheerfully advises bomb making instructions and writes lists of targets. I hope we'll opt for something sensible across all the safety axes.

umbra-scientia commented 1 year ago

THIS IS A DANGEROUSLY BAD IDEA.

Please, do not pursue this "synthetic CSAM" concept. Private companies quietly use datasets like these, thinking they are reasonable, without caring that basic science already shows this entire premise to be seriously, lethally harmful in real life. Do not confuse this warning with the usual "AI ethics" nonsense.


If you take NSFW content from the web, edit it and then train models to think that's CSAM, then wouldn't that be completely out of distribution?

This! This technical problem is most of why it's so dangerous, as it will lead to countless future models wrongly labeling innocent people as "Likely CSAM Offenders" because they happened to match one of the mannerisms you deemed as suitable for generating your fake data from.

The methods described will not even approximate samples from the actual distribution you're imagining, at all. It's not a "simple fix" but rather this idea can't even be salvaged. It's bad. Bluntly, it's dangerously wrong thinking, substituting wishes about how human intuition works in place of harsh realities about the statistics that neural networks accumulate.

Why not edit conversations about food to be synthetic CSAM? Why not conversations about dogs? Because of your prior belief that NSFW is more related to CSAM than dogs or food are. Any model using this dataset will learn almost exclusively that, "NSFW and CSAM are the same thing described with different key words." and almost nothing of what you actually want.

Please prevent this misguided effort before it wastes your time, wastes our GPU-hours, damages our models, and damages our world. People are already being labeled as "unemployable" or "high risk" today, and being rejected for jobs and loans because of it, thus decreasing their measured employability and increasing their measured risk factors.. This nightmare loop is not Hollywood, or some conspiracy, this is the cold reality today and human life is being lost to it even as you read these words. Your heart may be in the right place but this is a terrible idea, no offense.

shahules786 commented 1 year ago

@bitplane @umbra-scientia Thanks for raising your concern and I appreciate your contribution here. Our main goal is to acquire NSFW, Self-harm, and CSAM data from Reddit. We have now successfully obtained real NSFW & self-harm from Reddit. Trying to synthesize CSAM from this data was a mere experiment that we ourselves found out and agree doesn't work and is dangerous. So our dataset won't contain any synthetic CSAM data. I have released a subset of data here https://huggingface.co/datasets/shahules786/prosocial-nsfw-reddit (WIP). I recommend @jjmachan to rename the issue to avoid further confusion.

jjmachan commented 1 year ago

Thanks for your feedback folks, I'm skipping that part for now. I have a question for @bitplane and @umbra-scientia. If we build a system to flag CSAM, which metric should it be more careful about, false positives or false negatives? If you were to pick one to optimise for, which would that be?

bitplane commented 1 year ago

If we build a system to flag CSAM, which metric should it be more careful about, false positives or false negatives? If you were to pick one to optimise for, which would that be?

This isn't really my area of expertise, but I guess it depends on how we're using it:

  1. The training data being filtered so that CSAM isn't used, and the model has no concept of it.
  2. Attaching negative reward to text that looks like CSAM, teaching the model to not produce it.
  3. Filtering/flagging input prompts that look like CSAM, and rejecting user requests.
  4. Filtering output text that looks like CSAM, and regenerating them.

In 1, 3 and 4 you've got some threshold that you can set, like "if we're 80/95/99/99.9% sure, then take action". False positives don't matter as much as false negatives in any of those situations, but the first might introduce skew in weird ways - I'm not sure.

2 is more tricky. If you've got a million pieces of input data and only 10 of them contain CSAM (probably much lower in reality), and you get a false positive rate of 1% then you will have 1,000 times more damage than good by flagging it. I guess you'd need to evaluate the actual data that gets filtered.

And you need to pick thresholds there too, "babies like to play with balls" might give a 40% chance of being CSAM, but you don't want to give that -csam*0.4 reward.

x64x2 commented 1 year ago

Goals

  1. find NSFW and self-harm subreddits with the most text-based prompts.
  2. scrape questions from that
  3. Convert that into instructions

I would like to contribute this too, working on a bit now. Setup a pipeline to filter out "questions" from subreddits.

Sources

Metadata I will be collecting

  • score
  • link_flair_text
  • check is_self
  • permalink
  • over_18 - nsfw
  • upvote_ratio

cc @shahules786

NSFW Data collection