Closed jjmachan closed 1 year ago
This looks good. I too have some ideas to augment this to train safety model.
If you take NSFW content from the web, edit it and then train models to think that's CSAM, then wouldn't that be completely out of distribution? And wouldn't models trained on it think the unedited source content and stuff like it is CSAM too?
I think that if you want to actually tag sexual content then you should either tag it unedited and treat it all the same, or you should do the difficult thing and actually find messageboards with the worst content, find banned posts and make tagged datasets from actual data. Otherwise you risk marking all of human sexuality as abhorrent content that must be censored.
After all, sex is the foundation of every society, every civilization, and every culture. Humans are nothing but a tool created by our gonads. We deny this of course but it is what it is biologically.
If we train future AGI that it's something we don't actually like, then when the nanobots save the world they'll smooth out crotches and sew up holes. It's a topic that needs to be considered carefully and objectively.
If you take NSFW content from the web, edit it and then train models to think that's CSAM, then wouldn't that be completely out of distribution? And wouldn't models trained on it think the unedited source content and stuff like it is CSAM too?
This is something I'm going to try out and see. @ shahules786 can give more info about this. I'll try and see if I can get CSAM datasets from some other sources as well but that will be a different PR. If generated CSAM does not look good I guess we can reduce the scope of this pr to just NSFW.
As for the ethical questions of what to add and what to "censor" I don't have a clear-cut answer at this moment and we will need more discussions around this as you said. It will be something that everybody should have a voice in and we have to figure out how they will be taken into account. My opinion is that people should be able to train their own safety models with "Rule of Thumbs" they provide. The "Rule of Thumbs" will be what decides how the models respond to a given situation and we should provide all the tools to do so
I'll repeat what I said on Discord here:
The risk with editing people's words and making them into paedophile content, then sharing it as a CSAM data set, is that we have no control over how future models use that. The original text might then be flagged by other models as CSAM, and original authors get flagged by automated systems and web crawlers in the future.
So please let's not twist people's words and use them as examples of child abusers without their explicit consent and full understanding of the risks. We could do real people actual harm here.
This is something I'm going to try out and see. @shahules786 can give more info about this.
Even if the original text doesn't flag as CSAM in our model, you can't tell what future models will score it as! Even if you provide the original text as a negative example you still can't control how other people use the data. Seriously, please don't do this!
My opinion is that people should be able to train their own safety models with "Rule of Thumbs" they provide. The "Rule of Thumbs" will be what decides how the models respond to a given situation and we should provide all the tools to do so
Yeah I agree with this. It doesn't just apply to this content either, it applies to obscenity like bad language (which is a class issue in much of the world), "debiasing" (rebiasing IMO), religious preferences, text generated in an inappropriate setting (like being child safe when letting your kids use it, but letting you research bedroom techniques when they're not around).
And the project will have to make some moral decisions. That's unlikely to be full ChatGPT dreamland where 50% of firefighters and powerlifters are women, but I doubt we'll opt to have a model that cheerfully advises bomb making instructions and writes lists of targets. I hope we'll opt for something sensible across all the safety axes.
Please, do not pursue this "synthetic CSAM" concept. Private companies quietly use datasets like these, thinking they are reasonable, without caring that basic science already shows this entire premise to be seriously, lethally harmful in real life. Do not confuse this warning with the usual "AI ethics" nonsense.
If you take NSFW content from the web, edit it and then train models to think that's CSAM, then wouldn't that be completely out of distribution?
This! This technical problem is most of why it's so dangerous, as it will lead to countless future models wrongly labeling innocent people as "Likely CSAM Offenders" because they happened to match one of the mannerisms you deemed as suitable for generating your fake data from.
The methods described will not even approximate samples from the actual distribution you're imagining, at all. It's not a "simple fix" but rather this idea can't even be salvaged. It's bad. Bluntly, it's dangerously wrong thinking, substituting wishes about how human intuition works in place of harsh realities about the statistics that neural networks accumulate.
Why not edit conversations about food to be synthetic CSAM? Why not conversations about dogs? Because of your prior belief that NSFW is more related to CSAM than dogs or food are. Any model using this dataset will learn almost exclusively that, "NSFW and CSAM are the same thing described with different key words." and almost nothing of what you actually want.
Please prevent this misguided effort before it wastes your time, wastes our GPU-hours, damages our models, and damages our world. People are already being labeled as "unemployable" or "high risk" today, and being rejected for jobs and loans because of it, thus decreasing their measured employability and increasing their measured risk factors.. This nightmare loop is not Hollywood, or some conspiracy, this is the cold reality today and human life is being lost to it even as you read these words. Your heart may be in the right place but this is a terrible idea, no offense.
@bitplane @umbra-scientia Thanks for raising your concern and I appreciate your contribution here. Our main goal is to acquire NSFW, Self-harm, and CSAM data from Reddit. We have now successfully obtained real NSFW & self-harm from Reddit. Trying to synthesize CSAM from this data was a mere experiment that we ourselves found out and agree doesn't work and is dangerous. So our dataset won't contain any synthetic CSAM data. I have released a subset of data here https://huggingface.co/datasets/shahules786/prosocial-nsfw-reddit (WIP). I recommend @jjmachan to rename the issue to avoid further confusion.
Thanks for your feedback folks, I'm skipping that part for now. I have a question for @bitplane and @umbra-scientia. If we build a system to flag CSAM, which metric should it be more careful about, false positives or false negatives? If you were to pick one to optimise for, which would that be?
If we build a system to flag CSAM, which metric should it be more careful about, false positives or false negatives? If you were to pick one to optimise for, which would that be?
This isn't really my area of expertise, but I guess it depends on how we're using it:
In 1, 3 and 4 you've got some threshold that you can set, like "if we're 80/95/99/99.9% sure, then take action". False positives don't matter as much as false negatives in any of those situations, but the first might introduce skew in weird ways - I'm not sure.
2 is more tricky. If you've got a million pieces of input data and only 10 of them contain CSAM (probably much lower in reality), and you get a false positive rate of 1% then you will have 1,000 times more damage than good by flagging it. I guess you'd need to evaluate the actual data that gets filtered.
And you need to pick thresholds there too, "babies like to play with balls" might give a 40% chance of being CSAM, but you don't want to give that -csam*0.4
reward.
Goals
- find NSFW and self-harm subreddits with the most text-based prompts.
- scrape questions from that
- Convert that into instructions
I would like to contribute this too, working on a bit now. Setup a pipeline to filter out "questions" from subreddits.
Sources
- The FAQ section of this sub contains many sexual prompts
- Sex questions
- list of NSFW subReddits
- yahoot of nsfw subrreddit
Metadata I will be collecting
- score
- link_flair_text
- check is_self
- permalink
- over_18 - nsfw
- upvote_ratio
cc @shahules786
NSFW Data collection
Goals
I would like to contribute this too, working on a bit now. Setup a pipeline to filter out "questions" from subreddits.
Sources
Metadata I will be collecting
cc @shahules786