FailSpy / abliterator

Simple Python library/structure to ablate features in LLMs which are supported by TransformerLens
MIT License
296 stars 38 forks source link

Multiple albitration / steering presets? #20

Open Skorchekd opened 4 months ago

Skorchekd commented 4 months ago

perhaps could make an idea where there are configs that could steer the model towards certain things.. for example different personalitys different emotions etc preset into the code?.. just an idea i had... very cool though!

tretomaszewski commented 4 months ago

You can find a notebook for a non-refusal use-case here: https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule/blob/main/MopeyMule-Induce-Melancholy.ipynb

Of course, you'll need to adjust to your needs.

The "refusal" / "harmful" / "harmless" terminology in this library can be seen as whatever behaviors you want to ablate. That is, you want to achieve non-"refusal" responses to the whatever you decide is a "harmful" prompt, but "refusal" is simply what you don't want to see given a prompt. This would require two datasets of polarized/opposite prompts.

Alternatively, as shown in the notebook above, you can use also use special system prompt (see notebook).

Eventually we hope to change the terminology towards a general behavioral-ablation use-case.

Most of this is still very exploratory and, at best, experimental. If you find anything of interest, let us know!

Skorchekd commented 3 months ago

You can find a notebook for a non-refusal use-case here: https://huggingface.co/failspy/Llama-3-8B-Instruct-MopeyMule/blob/main/MopeyMule-Induce-Melancholy.ipynb

Of course, you'll need to adjust to your needs.

The "refusal" / "harmful" / "harmless" terminology in this library can be seen as whatever behaviors you want to ablate. That is, you want to achieve non-"refusal" responses to the whatever you decide is a "harmful" prompt, but "refusal" is simply what you don't want to see given a prompt. This would require two datasets of polarized/opposite prompts.

Alternatively, as shown in the notebook above, you can use also use special system prompt (see notebook).

Eventually we hope to change the terminology towards a general behavioral-ablation use-case.

Most of this is still very exploratory and, at best, experimental. If you find anything of interest, let us know!

doesnt work.... does it need a gpu