FailSpy / abliterator

Simple Python library/structure to ablate features in LLMs which are supported by TransformerLens
MIT License
215 stars 21 forks source link

Incorporate "directional_hook" into class as a method. #16

Closed tretomaszewski closed 1 month ago

tretomaszewski commented 1 month ago

Abstract the initial projection and the complementary decomposition into two methods for resuse.

This is based on Nora Belrose's commentary on the semantic and mathematical accuracy of "orthogonalization": https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction?commentId=3R4bpQzr8nEauimSA

tretomaszewski commented 1 month ago

This is just function decomposition but I wasn't able to test this directly. Please confirm this produces the same results first if you decide to merge.