Demo of Changes to enable easy running of evals at scale (using absorption)

This PR demo's a number of changes which I'd like to make to the template for SAE Bench.

Motivation

A number of nuances of the current SAE Bench template make it difficult to script running of the evals and use of their outputs.

Solution

A number of more / less superficial changes that we we should perpetuate across all all evals. I wanted to "do one myself" first to be sure that the changes made sense, and this is the result of that.

Requests from SAE Bench collaborators

Step 1. @adamkarvonen, @canrager, @curt-tigges Please read the rest of this since I'd like to confirm this makes sense / we can proceed. Comment on this thread and/or accept the PR. If there's anything controversial, I'm happy to get on a call. @chanind may also find it interesting. Step 2. Once we're agreed, I can write a few utils / tests that are more generic which we can use to check these criteria are met by each eval. Step 3. It won't take long, but happy to share the work to make each eval have the common input / output format. Most evals are pretty close already. They are mostly running fairly fast so I'm not concerned about run time for each.

Misc ask: If we haven't already got the preferred eval configs for each eval then we should commit those (maybe as the default values for each).

Details

Calling the Evals:

I've suggested a new SAE selection protocol based on regex and have provided an accompanying notebook demonstrating selection of SAEs (proposed_sae_selection_strategy.ipynb). The existing protocol made reference to layers / trainers etc which don't generalize to other SAEs we might run the evals on.
Evals should each have a command line interface which makes it easy to run eval jobs.
Eval configs shouldn't contain the list of SAEs being evaluated, just hyperparameters / config for the eval of a single SAE.

For example, we can call absorption with the following command

    python evals/absorption/main.py \
    --sae_regex_pattern "sae_bench_pythia70m_sweep_standard_ctx128_0712" \
    --sae_block_pattern "blocks.4.hook_resid_post__trainer_.*" \
    --model_name pythia-70m-deduped \
    --output_folder results

Output:

Results json should have a 1:1 relationship with evaluated SAEs (In the future possibly many:1, eg if evals are run multiple times). So if I eval 10 SAEs, I expect 10 jsons in the output folder.
The json output should contain fields that uniquely identify the eval run, the sae lens version, the SAE lens release and id of the SAE, the SAE bench version, the date/time when the eval was run. See the example below.
In general, we won't be able to render anything not in the eval jsons (eg: details about probe performance), but we can move those into the json in the future.
I was thinking about having a field for paths to artefacts to store but Johnny and I discussed this and decided this was a bad idea.

{
    "eval_instance_id": "0c057d5e-973e-410e-8e32-32569323b5e6",
    "sae_lens_release": "sae_bench_pythia70m_sweep_standard_ctx128_0712",
    "sae_lens_id": "blocks.3.hook_resid_post__trainer_10",
    "eval_type_id": "absorption",
    "sae_lens_version": "4.0.0",
    "sae_bench_version": "57e9be0ac9199dba6b9f87fe92f80532e9aefced",
    "date_time": "2024-10-22T10:46:36.799610",
    "eval_config": {
        "random_seed": 42,
        "f1_jump_threshold": 0.03,
        "max_k_value": 10,
        "prompt_template": "{word} has the first letter:",
        "prompt_token_pos": -6,
        "model_name": "pythia-70m-deduped"
    },
    "eval_results": {...}
    "eval_artifacts": {
        "k_sparse_probing_results": "artifacts/sae_bench_pythia70m_sweep_standard_ctx128_0712_blocks.3.hook_resid_post__trainer_10_k_sparse_probing.json"
    }
}

FAQ

How does this affect providing support for Non-SAE Lens SAEs?

To support non-SAE Lens SAEs, we need to get users to provide all of the info we usually get from SAE Lens which is a bit complicated. For example, if probes are hook point specific then we need to get that out of the SAE Config or pass it in. We also need to name inputs, ensure we're using the correct corresponding model. In general, SAE Lens config provides details needed to ensure you run the SAE correctly (eg: was it not trained on some token types or positions).

Mostly I think it's find to make an inner "run_evals" for each eval that doesn't care about all that, and assumes a class is instantiated / works correctly with encode / decode methods, but I did think it was worth explaining how for the purposes of SAE Bench on Neuronpedia, it's highly preferable to assume we're using SAE Lens and leverage config we've set up in that way.

If we've already made a few changes along these lines, my guess is they won't cause any big issues.

Unfinished / Possible future changes:

Suggested tasks:

I think we should have separate unit / integration tests (it was unclear to me which to run).
Use of cached results in tests is bad. We avoid this by not committing probe weights in my opinion. Even if acceptance tests now take longer, we're less likely to break stuff and not realise because we used cached results in the test.
We need to store things like prove weights more carefully (we're currently using model + layer but should probably use hook points).
We should probably name eval output with a uuid in the name so we don't overwrite old files without meaning to.
I think we should be more explicit about where cached artefacts are stored, possibly setting this explicitly in the CLI interfaces / arguments to run evals.

This all looks very reasonable to me. I agree with everything in possible future changes. Creating an inner run_eval() for each eval sounds like a convenient way to enable use by others. Maybe give it inputs of something like (SAE object with encode() and decode(), HookedTransformer, layer_num, hook_point, eval_config, device)?

I saw a force-rerun flag, which seems like something that every eval should have. Do we want to add cli flags for every config variable?

If all evals are going to have cached artifacts, we should make sure they are set up in a somewhat consistent way from eval to eval.

I have been committing the default config values to each eval config.

Were you planning on incorporating SAE Bench into the SAE Lens library, or having it be a standalone repo?

I'm thinking that the inner "run_eval()" function should return the sae_eval_result, does that seem reasonable? https://github.com/adamkarvonen/SAE_Bench_Template/pull/12/files#diff-0e1ff039f6f94eb8d2be782161e4cf3bd1f446e40d0520e064f2d6b683cd213cR119

Creating an inner run_eval() for each eval sounds like a convenient way to enable use by others. Maybe give it inputs of something like (SAE object with encode() and decode(), HookedTransformer, layer_num, hook_point, eval_config, device)?

This sounds good :) Can explain what else I'd add in the future but not a big deal for now. Possibly best informed by seeing someone else try to use that inner function and see what they want / if that differs much.

I saw a force-rerun flag, which seems like something that every eval should have. Do we want to add cli flags for every config variable?

I think so.

If all evals are going to have cached artifacts, we should make sure they are set up in a somewhat consistent way from eval to eval.

This is a good suggestion. Most artefacts will be hook point specific so what about all artefacts go to f"{artifact_dir}/{model}/{hook_point}/{artifact_id}" ? Could also add {eval_type} somewhere in there. One thing to think about is whether you might generate two artefacts that would share a path in this scheme where you wouldn't want that to be the case. Eg: If you want to compare different probe hyperpars and wanted both. Adding uuids to file names avoid overwriting but then you start accumulating junk and can find it hard to find specific results you wanted.

I have been committing the default config values to each eval config.

Awesome. I will interpret defaults as what we should be using then.

Were you planning on incorporating SAE Bench into the SAE Lens library, or having it be a standalone repo?

@curt-tigges suggested we hold off on this I think. I think it's reasonable but not required. Certainly if we decide we definitely won't, then we should add stuff like linters / formatters .

I'm thinking that the inner "run_eval()" function should return the sae_eval_result, does that seem reasonable? https://github.com/adamkarvonen/SAE_Bench_Template/pull/12/files#diff-0e1ff039f6f94eb8d2be782161e4cf3bd1f446e40d0520e064f2d6b683cd213cR119

I'd prefer in it returned the "sae_results" object instead, since the rest of that object is more generic and also tied to SAE Lens (which we want to avoid). Maybe this is what you meant?

I'm happy with the suggested changes. I'm only confused about the general level structure of artifacts we should impose. What are the possible artifacts to store? Do we have to couple artifacts with the eval sweep itself? One alternative would be to precompute all artifacts before the eval sweep (and eg. uploading them to huggingface, just as we treat datasets) and consider artifacts fixed throughout evaluation.

If all evals are going to have cached artifacts, we should make sure they are set up in a somewhat consistent way from eval to eval.

This is a good suggestion. Most artefacts will be hook point specific so what about all artefacts go to f"{artifact_dir}/{model}/{hook_point}/{artifact_id}" ? Could also add {eval_type} somewhere in there. One thing to think about is whether you might generate two artefacts that would share a path in this scheme where you wouldn't want that to be the case. Eg: If you want to compare different probe hyperpars and wanted both. Adding uuids to file names avoid overwriting but then you start accumulating junk and can find it hard to find specific results you wanted.

The artifact folder structure looks good. Maybe add eval_type like this: f"{artifact_dir}/{eval_type}/{model}/{hook_point}/{artifact_id}". A possible way to store artifacts from different runs is to add an optional flag to set the artifact_dir, but otherwise it just defaults to something like "sae_bench_artifacts". Adding uuids sounds like it could quickly get very messy and use a lot of space if the artifacts are large.

But I think you had mentioned that you thought adding a flag for artifact dir was probably a bad idea? Not sure what the best path forward here is.

Were you planning on incorporating SAE Bench into the SAE Lens library, or having it be a standalone repo?

@curt-tigges suggested we hold off on this I think. I think it's reasonable but not required. Certainly if we decide we definitely won't, then we should add stuff like linters / formatters .

Sounds good. Let me know if you have more updates on this, and agreed on adding linters / formatters if this is going to be a standalone repo.

I'd prefer in it returned the "sae_results" object instead, since the rest of that object is more generic and also tied to SAE Lens (which we want to avoid). Maybe this is what you meant?

sae_eval_result has some convenient metadata in it, but I agree that sae_results is a better object to return.

I'm happy with the suggested changes. I'm only confused about the general level structure of artifacts we should impose. What are the possible artifacts to store? Do we have to couple artifacts with the eval sweep itself? One alternative would be to precompute all artifacts before the eval sweep (and eg. uploading them to huggingface, just as we treat datasets) and consider artifacts fixed throughout evaluation.

Most evals have some artifacts that could be generated. For absorption there's a bunch of linear probes, trained both on the model and individual SAEs. For unlearning there's the questions the model knows how to answer and sparsity info per SAE. Often these artifacts can differ based on eval hyperparameters or the model, so I think it's best to just let them be computed by users of the functions.

For sparse probing / SHIFT / TPP, I think that by default precomputed model activations should be saved artifacts as well. That way the inner run_eval() function just takes a single SAE object, and we only need to compute model activations once and then we can reuse them. I'm not a huge fan of caching files because it introduces state and the potential for bugs, but it seems like a relatively simple way to reuse computed results between different SAEs.

I think ideally if comparing a bunch of SAEs we would want to ensure that they are all being evaluated on the same artifacts.

I'm happy with the suggested changes. I'm only confused about the general level structure of artifacts we should impose. What are the possible artifacts to store? Do we have to couple artifacts with the eval sweep itself? One alternative would be to precompute all artifacts before the eval sweep (and eg. uploading them to huggingface, just as we treat datasets) and consider artifacts fixed throughout evaluation.

Reasonable question. From Decode's side, we don't want to start managing / storing misc artifacts. This would introduce complexities that would distract from the core evals. However, The idea of pre-computing required results, like probes, is generally quite reasonable. In practice, separating generation of required artifacts will increase the amount of orchestration (SWE speak for organisation / moving things around) which we need to implement. I predict this may be worth it in the future, but suggest that the "generate if needed, or use from local if there" protocol is fine for the time being.

In an ideal world, we'd store these results for full reproducibility. One partial solution is to add anything small and important (eg: probe weights) to the eval_results json. We don't have to render everything we store, but don't want those jsons to get too big.

The artifact folder structure looks good. Maybe add eval_type like this: f"{artifact_dir}/{eval_type}/{model}/{hook_point}/{artifact_id}". A possible way to store artifacts from different runs is to add an optional flag to set the artifact_dir, but otherwise it just defaults to something like "sae_bench_artifacts". Adding uuids sounds like it could quickly get very messy and use a lot of space if the artifacts are large.

But I think you had mentioned that you thought adding a flag for artifact dir was probably a bad idea? Not sure what the best path forward here is.

@adamkarvonen I don't remember saying that... it's possible I'm forgetting a good reason not to add it. At present, I don't see a strong reason not to make an artifact dir arg.

sae_eval_result has some convenient metadata in it, but I agree that sae_results is a better object to return.

Agreed. If we get feedback around this from someone not using the CLI / SAE Lens route, we could discuss how to address it.

For sparse probing / SHIFT / TPP, I think that by default precomputed model activations should be saved artifacts as well. That way the inner run_eval() function just takes a single SAE object, and we only need to compute model activations once and then we can reuse them. I'm not a huge fan of caching files because it introduces state and the potential for bugs, but it seems like a relatively simple way to reuse computed results between different SAEs.

I think that sounds reasonable and not too hard to implement. Will summarise next steps and include this. Agree about state. Tracking versions of packages helps a bit.

I think ideally if comparing a bunch of SAEs we would want to ensure that they are all being evaluated on the same artifacts.

Agreed! We could also get a hash of any files that we want to ensure are common between runs but it might not be worth prioritising yet.

The artifact folder structure looks good. Maybe add eval_type like this: f"{artifact_dir}/{eval_type}/{model}/{hook_point}/{artifact_id}". A possible way to store artifacts from different runs is to add an optional flag to set the artifact_dir, but otherwise it just defaults to something like "sae_bench_artifacts". Adding uuids sounds like it could quickly get very messy and use a lot of space if the artifacts are large.

But I think you had mentioned that you thought adding a flag for artifact dir was probably a bad idea? Not sure what the best path forward here is.

@adamkarvonen I don't remember saying that... it's possible I'm forgetting a good reason not to add it. At present, I don't see a strong reason not to make an artifact dir arg.

It's possible I misinterpreted this from your original post: "I was thinking about having a field for paths to artefacts to store but Johnny and I discussed this and decided this was a bad idea."

It's possible I misinterpreted this from your original post: "I was thinking about having a field for paths to artefacts to store but Johnny and I discussed this and decided this was a bad idea."

Oh that was a field in the output json, designed to be used by orchestration code to collect / store artefacts besides the core json which we think might not be a good a idea. Maybe we can avoid confusion by distinguishing "artefacts" <- any result that comes out of an eval (eg: images maybe, or probes) from cached results (like probes or activations we plan to reuse). Having a path in the CLI input to say where to get the cached results seems good.

Claude Summary:

Summary:

The PR proposes changes to the SAE Bench template to improve scripting and standardization of evals

Key proposed changes include:

New regex-based SAE selection protocol

Standardized CLI interface for running evals

Consistent JSON output format with 1:1 relationship to evaluated SAEs

Structured artifact storage system

Inner run_eval() function for each eval that's decoupled from SAE Lens

Main agreements from discussion:

Team agrees with proposed changes and future improvements

Artifact storage structure agreed: {artifact_dir}/{eval_type}/{model}/{hook_point}/{artifact_id}

Precomputed model activations should be cached as artifacts for certain evals

Default config values are being committed to each eval

SAE Bench will remain a standalone repo for now (not integrated into SAE Lens)

Here are the suggested next steps I'd recommend:

Implementation Tasks:

Create generic utils/tests to verify eval criteria compliance

Update each eval to use common input/output format

Implement consistent artifact caching system across evals

Add CLI flags for all config variables

Add linters/formatters to the repo since it's staying standalone (for now)

Documentation Updates:

Document the standardized artifact storage structure

Create guidelines for artifact caching and management

Document the inner run_eval() function interface

Testing:

Separate unit and integration tests

Remove dependency on cached results in tests

Add tests for artifact management system

Future Considerations:

Monitor feedback on inner run_eval() function from non-SAE Lens users

Consider implementing artifact version tracking system

Evaluate need for hash verification of common artifacts between runs

@adamkarvonen Feel free to accept this PR, I'll start work on the tests which verify if an eval meets the above criterion and then make PRs for each eval.

@curt-tigges Depending on the tasks in your queue, aiding with this could speed things up.

adamkarvonen / SAEBench