FailSpy / abliterator

Simple Python library/structure to ablate features in LLMs which are supported by TransformerLens
MIT License
296 stars 38 forks source link

Understanding Current Project Workflow and Contribution Directions #7

Open tretomaszewski opened 4 months ago

tretomaszewski commented 4 months ago

Hello,

Thanks for creating this! I'm researching induction of behavioral change and was working on a generalized version for a project. I figured I might as well help out on an established repo.

I'm interested in contributing: documenting and to help to refactor and generalize the code, potentially towards a proper package. Since this is a bit of a hybrid and preliminary repo, working towards a more consistent style would also be nice.

Before I start, I'd like to know a bit more about some things:

  1. You mention your personal workflow. Would you provide some more information regarding this? I understand that you are using it to create ablated model weights, but knowing the process will help with documentation, commenting, and restructuring.

  2. What are some of your future plans? I noticed generalization being a big one, which is a goal I share. Since this process can be framed as contrastive search and edit, this should be decently straightforward. Though it will require a few modifications which might disturb your current workflow...

  3. I notice there are more than a few apparently unused methods and some variables that either need defined or need a reference to their instance (self). (See the method calculate_mean_dirs and the variable direction). I understand some methods to be remnants of earlier stages of this process and may or may not be needed depending on your current workflow.

  4. Are there any specific areas that you'd like to focus contributions on first?

  5. Do you have a preference for code or docstring styles?

Looking forward to contributing!

FailSpy commented 4 months ago

Oh wow, thanks for all this! I hadn't announced this anywhere so wasn't expecting much attention on it yet.

  1. Probably the best way to express this is for me to soon provide a few toy-scripts/notebooks that show an idea of isolated concepts in using this library.

  2. Don't worry about disturbing my workflow, as the goal of this project is to have a library to minimize the "need" for a workflow and rather just have a nice straightforward process one can run. Generalization is a huge part of the push to a library, from dealing with different models, to potentially dealing with multidimensional features (if such a thing is possible)

  3. You'd be correct that most of that is remnants of past lives of my personal script

4a. Documentation, which is kind of on me. 4b. Finding better techniques/methods that require less human intervention 4c. Improving compatibility with the transformers space 4d. Improving memory usage and optimizing 4e. HF model export process.

  1. Nope, open to suggestions.
tretomaszewski commented 4 months ago

Admittedly, I looked up your github after seeing the models on Reddit/Huggingface. I was happy to see this!

Good to know about the workflow, flexibility, and your ambitions. I can think of several uses besides simple refusal ablation, especially with inference-time interventions.

Would you clarify "transformers space" in point (4c)?