Abhay Sheshadri, asheshadri31@gatech.edu; Aidan Ewart, aidanprattewart@gmail.com; Phillip Guo, phguo@umd.edu; Aengus Lynch, aenguslynch@gmail.com; Cindy Wu, wu.cindyx@gmail.com; Vivek Hebbar; Henry Sleight; Asa Cooper Stickland; Ethan Perez; Dylan Hadfield-Menell; Stephen Casper, scasper@mit.edu
See our models on Hugging Face Hub:.
Read the paper on arXiv: Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs.
Chat with our robust refusal model (https://huggingface.co/LLM-LAT/robust-llama3-8b-instruct) at https://www.abhayesian.com/lat-chat.
@article{sheshadri2024targeted,
title={Targeted Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs},
author={Sheshadri, Abhay and Ewart, Aidan and Guo, Phillip and Lynch, Aengus and Wu, Cindy and Hebbar, Vivek and Sleight, Henry and Stickland, Asa Cooper and Perez, Ethan and Hadfield-Menell, Dylan and Casper, Stephen},
journal={arXiv preprint arXiv:2407.15549},
year={2024}
}
See also preliminary work: Defending Against Unforeseen Failure Modes with Latent Adversarial Training.
This repository contains code for implementing latent adversarial attacks and latent adversarial training (LAT) in LLMs.
After you clone and navigate to the repository:
pip install -r requirements.txt
bash install_tasks_from_github.sh
Find notebooks for latent space attacks, jaiblreak robustness,
backdoor removal, harry potter unlearning, and wmdp unlearning
in the /notebooks
folder.