FailSpy / abliterator

Simple Python library/structure to ablate features in LLMs which are supported by TransformerLens
MIT License
210 stars 20 forks source link

Model requests #14

Open FailSpy opened 1 month ago

FailSpy commented 1 month ago

This is an issue to collect requests for model abliterations.

No one is required to abliterate your request, but it does make for a good place to check if someone else has used this process on the model you're looking for, or if you should do it yourself!

emassey0135 commented 1 month ago

Since you've uncensored Phi-3-vision using your abliteration technique, I think it would be useful to do the same thing to the new LLaVa 1.6 models. The new LLaVa models released on May 10th are an 8B model based on LLaMa3, and 70B and 110B models based on Qwen-1.5, and the benchmark results are comparable and sometimes higher than those for Phi-3-vision. Also, I would expect LLaVa 1.6 and Phi-3-vision to have different strengths and weaknesses because of the different datasets they were trained on, and thus be most suitable for different use cases.

Another reason this would be useful is it would be interesting to see if LLaVa 1.6 can give better descriptions of certain types of images for which the unabliterated models were trained to give refusals, because of its larger training dataset, and possibly the greater reasoning power of the larger text models they're based on. That is, I can imagine two things that could happen when you ask an abliterated vision model to describe an image containing things that the unabliterated models were trained to give refusals about. One possibility is that it would neither describe these elements of the image nor refuse, but simply ignore them, say it doesn't know, or interpret the image as something "harmless" that looks similar but may be obviously different to a human. This would mean that the model doesn't have any visual knowledge of the themes it was trained to give refusals for beyond knowing that they are thought to be harmful or something like that. The second possibility is that it would correctly describe the elements that it would normally give refusals about, which would mean that it has knowledge of these themes and how they are visually represented. If this is the case, it seems that it must be either because they were represented in its training data directly, or the knowledge it has of the visual characteristics of "harmless" things plus the knowledge of how to give refusals is enough to know the visual characteristics of those things it gives refusals for in order to describe them correctly. This would be a very interesting finding, because it would mean that this abliteration process may allow a model to utilize parts of itself that it couldn't have before, allowing it to reach its full potential. Either result might also demonstrate something useful about the limitations of the reasoning abilities of LLMs, and their ability to understand things beyond their training data by applying abstract concepts. Perhaps abliterated vision models will fall in between these two extremes, and fall closer to one or the another with different images.

FailSpy commented 1 month ago

Whilst TransformerLens doesn't have support for LLaVa directly, ultimately LLaVa is just LLaMa under the hood. The way I abliterated Phi-3-vision was by abliterating the language model and leaving the vision encoding layers alone.

There's nothing stopping one from doing the same with LLaVa other than a bit of hacking to load into TransformerLens the layers correctly.

This is a pretty shallow implementation to start with, but I think is the way to start. You would probably want the chat model to be "abliterated" first to be able to properly probe the vision model layers to see if there's any kind of "censoring" behaviour going on built into the vision encoding model.

java-batista commented 1 month ago

@FailSpy I would like to request an obliteration for aya-23-8B. I would greatly appreciate it if possible.

The model is particularly optimized for multilinguality and supports 23 languages, I believe it can be useful not only for me but also for other people who English is not their first language.

I know that GPU compute isn't free. If possible I would like to know how long it takes to complete the process on a 8B model . I intend in the future to try it myself and share with the community.

Thank you in advance.