Closed tretomaszewski closed 1 month ago
Abstract the initial projection and the complementary decomposition into two methods for resuse.
This is based on Nora Belrose's commentary on the semantic and mathematical accuracy of "orthogonalization": https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction?commentId=3R4bpQzr8nEauimSA
This is just function decomposition but I wasn't able to test this directly. Please confirm this produces the same results first if you decide to merge.
Abstract the initial projection and the complementary decomposition into two methods for resuse.
This is based on Nora Belrose's commentary on the semantic and mathematical accuracy of "orthogonalization": https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction?commentId=3R4bpQzr8nEauimSA