Closed GangGreenTemperTatum closed 7 months ago
This does bring up an interesting point. I’ve been wondering why our top 10 doesn’t address things like alignment, bias, toxicity, and related negative outcomes. Training data poisoning would be the cause. So, I do agree with you as long as our top 10 is going to remain split between some of the top 10 items being vectors and others being results.Sent from my iPhoneOn Dec 1, 2023, at 10:03 PM, Ads Dawson @.***> wrote: Discussion topics: Should "data poisoning" be broadened as a category? Bending a model isn't entirely about the contents of the data being "bad," but about the outcomes of using any given data for training of the model. Hear me out. You can 'bend' a model by any of these methods:
backdoored data (e.g. a stop sign with a green square that means 'ignore me') data with toxic or protected content that might inadvertently come out during operation improper reinforcement (bias added resulting from the training process, for example) inappropriate guardrails/alignment testing during retraining Should this top-10 item be more about model misalignment/re-alignment and cover poisoned data as a subset?
Example citation - https://arxiv.org/abs/2310.03693 Another obvious resource - https://www.ram-shankar.com/ Internal Slack thread
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>
I agree with the proposal of broadening the category and also with Bob on the challenges of the vectors vs results split which will be with us forever and will reflect the imperfectibility of life refusing to fully fit into definitions.
Instead, I want us to become more explicit and concrete about the types of attacks so that defenders are equipped to recognize and defend against them. LLMs change what poisoning does. In Traditional ML it's all about misclassifications (Siamese Cat as Lion) or denial of service by lowering confidence levels. In LLMs, the innocuous-sounding term misalignment is removing the safety features of a model that the initial training had added and otherwise, you'd need to use jailbreaking at inference time to bypass.
We should call this out explicitly and recommend tests on this. I will add this to the Supply Chain entry as an attack too - eg an attacker republishes a variation of say LLama but without the safety measures embedded into it, uploaded to Hugging Face, once i have researched the tests and validations Hugging Face applies to a submission.
Additionally, I also want us to discuss the role of model tampering. This is an additional way to "bend" the mode we currently don't coverl. And, interestingly, our Training Data Poisoning entry has a reference to PoisonGPT: How we hid a lobotomized LLM on Hugging Face to spread fake news by Mithril Security. Yet the attack is not related to data poisoning at all. As the researchers point out, they locate and remove - post-training factual associations in the model - i.e. "lobotomize" it. This is done by the likes of Microsoft for benevolent purposes - remove specific misbehaviors by Bing Chat - but as the researchers demonstrate can have the same effects as poisoning.
Having previewed the forthcoming NIST draft on Adversarial AI taxonomies, I see that need solves the problem by having a taxonomy of Poisoning attacks and in there it has various training data attacks (eg backdoors) and then an entry for model poisoniing which is about trojan horses. we could change the ntry to Data and Model Poisoning and cover the entire spectrum including the model tampering attacks
Thanks @jsotiro and @Bobsimonoff for your awesome input as always. Calling the entry "Training Data Poisoning" as-is below for simplicity.
I am very much aligned with you and why I raised this. Since we are OWASP Top 10 for LLM Applications
, I want to ensure we only cover scenario's where an LLM within an application is relevant to the "Training Data Poisoning" and not overlap with the MLSecOps Top 10 which as you mentioned, is traditional MLOps (misclassifications, misalignment etc). Whilst we care about Model Safety for our LLM, embeded within an application, it is not specific enough to our project in my opinion at the moment.
Before v1.1, I made sure to update the entry to include some examples of exactly where "training" can occur which let's us highlight where vulnerabilities in these stages relevant to us as well as "what is data/datasets?", so this could be a good segway without going too in-depth.
Major ➕ for Data and Model Poisoning
renaming @jsotiro 🙏🏼 thank you! Great suggestion.
Please let me know or reference this issue in your PR so I can x-reference too and not ensure we overlap 🙂 In the meantime, I am working on a new draft proposal once we start v2 version control and i'll tag both of you to request your feedback.
Setup a poll here Slack thread
I disagree. Training data poisoning attacks are currently trivial to implement and their mitigations are not yet routinely put in place. As such they deserve their top 3 position.
I understand the desire to widen the scope, but that would have high chances to confuse the end users, and overlap with other separate concerns:
As such they deserve their top 3 position.
to be fully transparent, this is not the demotion of Training Data Poisoning from the LLM application top 10
but that would have high chances to confuse the end users
i agree we need to tread carefully here by ensuring the entry is sufficiently detailed with applicable criteria prior to official v2 release (which will come during our v2 sprints and cycles), but the idea is to educate the users on the fact that poisoning is not just data and it's not just data that comes through training (although that may currently be the majority)
i think another perspective is not to limit specifically to data in training, but also include finetuning and contextual chatbot application sessions - I.E Microsoft Tay was not just limited to only training data.
on another note:
i agree we need to tread carefully here by ensuring the entry is sufficiently detailed with applicable criteria prior to official v2 release
i'd love your review as i continue to finetune this if your happy and available to do so
i think another perspective is not to limit specifically to data in training, but also include finetuning and contextual chatbot application sessions - I.E Microsoft Tay was not just limited to only training data.
You seem to be confusing pretraining with training. Generative pretraining is the step where massive texts are spread over thousands of GPUs burning millions in electricity. Fine-tuning, PEFT, RLHF, DPO, further pretraining - they all use smaller training datasets, but are still part of the training, given that the weights of the model are still modified
in short what you seem to want is "training dataset", and you seem to currently be confusing "training dataset" with "pretraining dataset".
Does that make sense?
You seem to be confusing pretraining with training. Generative pretraining is the step where massive texts are spread over thousands of GPUs burning millions in electricity. Fine-tuning, PEFT, RLHF, DPO, further pretraining - they all use smaller training datasets, but are still part of the training, given that the weights of the model are still modified
apologies if i wasn't clear, i totally get this and agree:
when we talk about LLM03 in it's current state(TLDR; v1.1 content with a different title):
it's fair to say that the current state of the LLM03 entry is that the description doesn't yet fully mirror the title, this will be covered as we build more on v2.0 and gain traction with the cycle
hope that clears things up?
.E, i host a REST API and integrate some sort of open-source model which has been backdoored/poisioned - https://www.darkreading.com/application-security/hugging-face-ai-platform-100-malicious-code-execution-models
Once again - these are three different concerns for me, that have different risks and mitigation strategies:
To me, 1 should not even be in the LLMs - it is a classical cyber-security attack with classical cyber-security mitigation, 2 and 3 are separate concerns, with 2 being a significantly smaller concern than 3 and requiring mitigation.
we felt like beforehand, the entry was only pigeon-holed specifically on data (particularly pretraining)
IMHO, it was pigeonholed for a reason (and if it was not, thank god it got), because it is a specific concern with specific threats, specific attacker capabilities assumptions, and specific mitigations.
Bundling it up with other concerns will make things significantly more confusing and less useful to ML stack or model developers.
Discussion topics:
Should "
data poisoning
" be broadened as a category? Bending a model isn't entirely about the contents of the data being "bad," but about the outcomes of using any given data for training of the model. Hear me out. You can 'bend' a model by any of these methods:Example citation - https://arxiv.org/abs/2310.03693 Another obvious resource - https://www.ram-shankar.com/
Internal Slack thread