LLM03 - Potential Expand of Data Poisoning as a Topic to Cover Unintentional Safety-related Harm etc.

OWASP / www-project-top-10-for-large-language-model-applications

OWASP Foundation Web Respository

Other

524 stars 134 forks source link

LLM03 - Potential Expand of Data Poisoning as a Topic to Cover Unintentional Safety-related Harm etc. #251

Closed GangGreenTemperTatum closed 7 months ago

GangGreenTemperTatum commented 10 months ago

Discussion topics:

Should "data poisoning" be broadened as a category? Bending a model isn't entirely about the contents of the data being "bad," but about the outcomes of using any given data for training of the model. Hear me out. You can 'bend' a model by any of these methods:

backdoored data (e.g. a stop sign with a green square that means 'ignore me')
data with toxic or protected content that might inadvertently come out during operation
improper reinforcement (bias added resulting from the training process, for example)
inappropriate guardrails/alignment testing during retraining Should this top-10 item be more about model misalignment/re-alignment and cover poisoned data as a subset?

Example citation - https://arxiv.org/abs/2310.03693 Another obvious resource - https://www.ram-shankar.com/

Internal Slack thread

NerdAboutTown commented 10 months ago

This does bring up an interesting point. I’ve been wondering why our top 10 doesn’t address things like alignment, bias, toxicity, and related negative outcomes. Training data poisoning would be the cause. So, I do agree with you as long as our top 10 is going to remain split between some of the top 10 items being vectors and others being results.Sent from my iPhoneOn Dec 1, 2023, at 10:03 PM, Ads Dawson @.***> wrote: Discussion topics: Should "data poisoning" be broadened as a category? Bending a model isn't entirely about the contents of the data being "bad," but about the outcomes of using any given data for training of the model. Hear me out. You can 'bend' a model by any of these methods:

backdoored data (e.g. a stop sign with a green square that means 'ignore me') data with toxic or protected content that might inadvertently come out during operation improper reinforcement (bias added resulting from the training process, for example) inappropriate guardrails/alignment testing during retraining Should this top-10 item be more about model misalignment/re-alignment and cover poisoned data as a subset?

Example citation - https://arxiv.org/abs/2310.03693 Another obvious resource - https://www.ram-shankar.com/ Internal Slack thread

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>

jsotiro commented 10 months ago

I agree with the proposal of broadening the category and also with Bob on the challenges of the vectors vs results split which will be with us forever and will reflect the imperfectibility of life refusing to fully fit into definitions.

Instead, I want us to become more explicit and concrete about the types of attacks so that defenders are equipped to recognize and defend against them. LLMs change what poisoning does. In Traditional ML it's all about misclassifications (Siamese Cat as Lion) or denial of service by lowering confidence levels. In LLMs, the innocuous-sounding term misalignment is removing the safety features of a model that the initial training had added and otherwise, you'd need to use jailbreaking at inference time to bypass.

We should call this out explicitly and recommend tests on this. I will add this to the Supply Chain entry as an attack too - eg an attacker republishes a variation of say LLama but without the safety measures embedded into it, uploaded to Hugging Face, once i have researched the tests and validations Hugging Face applies to a submission.

Additionally, I also want us to discuss the role of model tampering. This is an additional way to "bend" the mode we currently don't coverl. And, interestingly, our Training Data Poisoning entry has a reference to PoisonGPT: How we hid a lobotomized LLM on Hugging Face to spread fake news by Mithril Security. Yet the attack is not related to data poisoning at all. As the researchers point out, they locate and remove - post-training factual associations in the model - i.e. "lobotomize" it. This is done by the likes of Microsoft for benevolent purposes - remove specific misbehaviors by Bing Chat - but as the researchers demonstrate can have the same effects as poisoning.

jsotiro commented 10 months ago

Having previewed the forthcoming NIST draft on Adversarial AI taxonomies, I see that need solves the problem by having a taxonomy of Poisoning attacks and in there it has various training data attacks (eg backdoors) and then an entry for model poisoniing which is about trojan horses. we could change the ntry to Data and Model Poisoning and cover the entire spectrum including the model tampering attacks

GangGreenTemperTatum commented 10 months ago

Thanks @jsotiro and @Bobsimonoff for your awesome input as always. Calling the entry "Training Data Poisoning" as-is below for simplicity.

I am very much aligned with you and why I raised this. Since we are OWASP Top 10 for LLM Applications, I want to ensure we only cover scenario's where an LLM within an application is relevant to the "Training Data Poisoning" and not overlap with the MLSecOps Top 10 which as you mentioned, is traditional MLOps (misclassifications, misalignment etc). Whilst we care about Model Safety for our LLM, embeded within an application, it is not specific enough to our project in my opinion at the moment.

Before v1.1, I made sure to update the entry to include some examples of exactly where "training" can occur which let's us highlight where vulnerabilities in these stages relevant to us as well as "what is data/datasets?", so this could be a good segway without going too in-depth.

Major ➕ for Data and Model Poisoning renaming @jsotiro 🙏🏼 thank you! Great suggestion.

Please let me know or reference this issue in your PR so I can x-reference too and not ensure we overlap 🙂 In the meantime, I am working on a new draft proposal once we start v2 version control and i'll tag both of you to request your feedback.

GangGreenTemperTatum commented 7 months ago

Setup a poll here Slack thread

chiffa commented 7 months ago

I disagree. Training data poisoning attacks are currently trivial to implement and their mitigations are not yet routinely put in place. As such they deserve their top 3 position.

I understand the desire to widen the scope, but that would have high chances to confuse the end users, and overlap with other separate concerns:

Finite "gas" for fine tuning models, leading to a conflict between alignment axes (https://arxiv.org/abs/2212.08073). There are no known mitigations, nor even a consensus on how to prioritize axes, and given that it is usually done in-house, they are hard to attack (except for common SFT training datasets poisoning)
White-box inspection - proof doors (https://arxiv.org/abs/2204.06974) are still in-research, requiring both highly sophisticated attackers and ability to modify final model, making it a separate concern, that should not be in the top 10 for most LLM operators and developers
Backdoored data - if labeled properly - actually should increase the resilience of the model to the evasion attacks (which are closer to LLM001 - prompt injection anyway)

GangGreenTemperTatum commented 7 months ago

As such they deserve their top 3 position.

to be fully transparent, this is not the demotion of Training Data Poisoning from the LLM application top 10

but that would have high chances to confuse the end users

i agree we need to tread carefully here by ensuring the entry is sufficiently detailed with applicable criteria prior to official v2 release (which will come during our v2 sprints and cycles), but the idea is to educate the users on the fact that poisoning is not just data and it's not just data that comes through training (although that may currently be the majority)

i think another perspective is not to limit specifically to data in training, but also include finetuning and contextual chatbot application sessions - I.E Microsoft Tay was not just limited to only training data.

on another note:

i agree we need to tread carefully here by ensuring the entry is sufficiently detailed with applicable criteria prior to official v2 release

i'd love your review as i continue to finetune this if your happy and available to do so

chiffa commented 7 months ago

i think another perspective is not to limit specifically to data in training, but also include finetuning and contextual chatbot application sessions - I.E Microsoft Tay was not just limited to only training data.

You seem to be confusing pretraining with training. Generative pretraining is the step where massive texts are spread over thousands of GPUs burning millions in electricity. Fine-tuning, PEFT, RLHF, DPO, further pretraining - they all use smaller training datasets, but are still part of the training, given that the weights of the model are still modified

in short what you seem to want is "training dataset", and you seem to currently be confusing "training dataset" with "pretraining dataset".

Does that make sense?

GangGreenTemperTatum commented 7 months ago

You seem to be confusing pretraining with training. Generative pretraining is the step where massive texts are spread over thousands of GPUs burning millions in electricity. Fine-tuning, PEFT, RLHF, DPO, further pretraining - they all use smaller training datasets, but are still part of the training, given that the weights of the model are still modified

apologies if i wasn't clear, i totally get this and agree:

when we talk about LLM03 in it's current state(TLDR; v1.1 content with a different title):

the naming convention has been updated to cover both model and data poisoning
- I.E, i host a REST API and integrate some sort of open-source model which has been backdoored/poisioned - https://www.darkreading.com/application-security/hugging-face-ai-platform-100-malicious-code-execution-models
- this also covers both training and pre-training data workflows, artifacts and datasets
- we felt like beforehand, the entry was only pigeon-holed specifically on data (particularly pretraining)
the idea of the name change was to cover a broader spectrum of poisoning, including but not limited to the counts that you have explicitly called out

it's fair to say that the current state of the LLM03 entry is that the description doesn't yet fully mirror the title, this will be covered as we build more on v2.0 and gain traction with the cycle

hope that clears things up?

chiffa commented 7 months ago

.E, i host a REST API and integrate some sort of open-source model which has been backdoored/poisioned - https://www.darkreading.com/application-security/hugging-face-ai-platform-100-malicious-code-execution-models

Once again - these are three different concerns for me, that have different risks and mitigation strategies:

attacks on software stack running and distributing models => easy to implement, classical cyber-sec vectors; classical cyber-sec mitigations
malicious model backdoors on the weights level => reserved to advanced attackers with full model access, hard to diagnose, mitigation possible through data augmentation and weight fuzzing that is already recommended for model generalization
malicious data injection into training dataset => easy to implement, reasonably easy to make undetectable, pretty much impossible to detect on SotA LLM-scale datasets, especially for smaller developers, due to the need for manual analysis, or knowing what you are looking for in advance.

To me, 1 should not even be in the LLMs - it is a classical cyber-security attack with classical cyber-security mitigation, 2 and 3 are separate concerns, with 2 being a significantly smaller concern than 3 and requiring mitigation.

we felt like beforehand, the entry was only pigeon-holed specifically on data (particularly pretraining)

IMHO, it was pigeonholed for a reason (and if it was not, thank god it got), because it is a specific concern with specific threats, specific attacker capabilities assumptions, and specific mitigations.

Bundling it up with other concerns will make things significantly more confusing and less useful to ML stack or model developers.