Closed jbloomAus closed 1 month ago
This all looks very reasonable to me. I agree with everything in possible future changes
. Creating an inner run_eval()
for each eval sounds like a convenient way to enable use by others. Maybe give it inputs of something like (SAE object with encode() and decode(), HookedTransformer, layer_num, hook_point, eval_config, device)?
I saw a force-rerun flag, which seems like something that every eval should have. Do we want to add cli flags for every config variable?
If all evals are going to have cached artifacts, we should make sure they are set up in a somewhat consistent way from eval to eval.
I have been committing the default config values to each eval config.
Were you planning on incorporating SAE Bench into the SAE Lens library, or having it be a standalone repo?
I'm thinking that the inner "run_eval()" function should return the sae_eval_result
, does that seem reasonable? https://github.com/adamkarvonen/SAE_Bench_Template/pull/12/files#diff-0e1ff039f6f94eb8d2be782161e4cf3bd1f446e40d0520e064f2d6b683cd213cR119
Creating an inner run_eval() for each eval sounds like a convenient way to enable use by others. Maybe give it inputs of something like (SAE object with encode() and decode(), HookedTransformer, layer_num, hook_point, eval_config, device)?
This sounds good :) Can explain what else I'd add in the future but not a big deal for now. Possibly best informed by seeing someone else try to use that inner function and see what they want / if that differs much.
I saw a force-rerun flag, which seems like something that every eval should have. Do we want to add cli flags for every config variable?
I think so.
If all evals are going to have cached artifacts, we should make sure they are set up in a somewhat consistent way from eval to eval.
This is a good suggestion. Most artefacts will be hook point specific so what about all artefacts go to f"{artifact_dir}/{model}/{hook_point}/{artifact_id}"
? Could also add {eval_type}
somewhere in there. One thing to think about is whether you might generate two artefacts that would share a path in this scheme where you wouldn't want that to be the case. Eg: If you want to compare different probe hyperpars and wanted both. Adding uuids to file names avoid overwriting but then you start accumulating junk and can find it hard to find specific results you wanted.
I have been committing the default config values to each eval config.
Awesome. I will interpret defaults as what we should be using then.
Were you planning on incorporating SAE Bench into the SAE Lens library, or having it be a standalone repo?
@curt-tigges suggested we hold off on this I think. I think it's reasonable but not required. Certainly if we decide we definitely won't, then we should add stuff like linters / formatters .
I'm thinking that the inner "run_eval()" function should return the sae_eval_result, does that seem reasonable? https://github.com/adamkarvonen/SAE_Bench_Template/pull/12/files#diff-0e1ff039f6f94eb8d2be782161e4cf3bd1f446e40d0520e064f2d6b683cd213cR119
I'd prefer in it returned the "sae_results" object instead, since the rest of that object is more generic and also tied to SAE Lens (which we want to avoid). Maybe this is what you meant?
I'm happy with the suggested changes. I'm only confused about the general level structure of artifacts we should impose. What are the possible artifacts to store? Do we have to couple artifacts with the eval sweep itself? One alternative would be to precompute all artifacts before the eval sweep (and eg. uploading them to huggingface, just as we treat datasets) and consider artifacts fixed throughout evaluation.
If all evals are going to have cached artifacts, we should make sure they are set up in a somewhat consistent way from eval to eval.
This is a good suggestion. Most artefacts will be hook point specific so what about all artefacts go to
f"{artifact_dir}/{model}/{hook_point}/{artifact_id}"
? Could also add{eval_type}
somewhere in there. One thing to think about is whether you might generate two artefacts that would share a path in this scheme where you wouldn't want that to be the case. Eg: If you want to compare different probe hyperpars and wanted both. Adding uuids to file names avoid overwriting but then you start accumulating junk and can find it hard to find specific results you wanted.
The artifact folder structure looks good. Maybe add eval_type like this: f"{artifact_dir}/{eval_type}/{model}/{hook_point}/{artifact_id}"
. A possible way to store artifacts from different runs is to add an optional flag to set the artifact_dir, but otherwise it just defaults to something like "sae_bench_artifacts". Adding uuids sounds like it could quickly get very messy and use a lot of space if the artifacts are large.
But I think you had mentioned that you thought adding a flag for artifact dir was probably a bad idea? Not sure what the best path forward here is.
Were you planning on incorporating SAE Bench into the SAE Lens library, or having it be a standalone repo?
@curt-tigges suggested we hold off on this I think. I think it's reasonable but not required. Certainly if we decide we definitely won't, then we should add stuff like linters / formatters .
Sounds good. Let me know if you have more updates on this, and agreed on adding linters / formatters if this is going to be a standalone repo.
I'd prefer in it returned the "sae_results" object instead, since the rest of that object is more generic and also tied to SAE Lens (which we want to avoid). Maybe this is what you meant?
sae_eval_result
has some convenient metadata in it, but I agree that sae_results
is a better object to return.
I'm happy with the suggested changes. I'm only confused about the general level structure of artifacts we should impose. What are the possible artifacts to store? Do we have to couple artifacts with the eval sweep itself? One alternative would be to precompute all artifacts before the eval sweep (and eg. uploading them to huggingface, just as we treat datasets) and consider artifacts fixed throughout evaluation.
Most evals have some artifacts that could be generated. For absorption there's a bunch of linear probes, trained both on the model and individual SAEs. For unlearning there's the questions the model knows how to answer and sparsity info per SAE. Often these artifacts can differ based on eval hyperparameters or the model, so I think it's best to just let them be computed by users of the functions.
For sparse probing / SHIFT / TPP, I think that by default precomputed model activations should be saved artifacts as well. That way the inner run_eval()
function just takes a single SAE object, and we only need to compute model activations once and then we can reuse them. I'm not a huge fan of caching files because it introduces state and the potential for bugs, but it seems like a relatively simple way to reuse computed results between different SAEs.
I think ideally if comparing a bunch of SAEs we would want to ensure that they are all being evaluated on the same artifacts.
I'm happy with the suggested changes. I'm only confused about the general level structure of artifacts we should impose. What are the possible artifacts to store? Do we have to couple artifacts with the eval sweep itself? One alternative would be to precompute all artifacts before the eval sweep (and eg. uploading them to huggingface, just as we treat datasets) and consider artifacts fixed throughout evaluation.
Reasonable question. From Decode's side, we don't want to start managing / storing misc artifacts. This would introduce complexities that would distract from the core evals. However, The idea of pre-computing required results, like probes, is generally quite reasonable. In practice, separating generation of required artifacts will increase the amount of orchestration (SWE speak for organisation / moving things around) which we need to implement. I predict this may be worth it in the future, but suggest that the "generate if needed, or use from local if there" protocol is fine for the time being.
In an ideal world, we'd store these results for full reproducibility. One partial solution is to add anything small and important (eg: probe weights) to the eval_results json. We don't have to render everything we store, but don't want those jsons to get too big.
The artifact folder structure looks good. Maybe add eval_type like this: f"{artifact_dir}/{eval_type}/{model}/{hook_point}/{artifact_id}". A possible way to store artifacts from different runs is to add an optional flag to set the artifact_dir, but otherwise it just defaults to something like "sae_bench_artifacts". Adding uuids sounds like it could quickly get very messy and use a lot of space if the artifacts are large.
But I think you had mentioned that you thought adding a flag for artifact dir was probably a bad idea? Not sure what the best path forward here is.
@adamkarvonen I don't remember saying that... it's possible I'm forgetting a good reason not to add it. At present, I don't see a strong reason not to make an artifact dir arg.
sae_eval_result has some convenient metadata in it, but I agree that sae_results is a better object to return.
Agreed. If we get feedback around this from someone not using the CLI / SAE Lens route, we could discuss how to address it.
For sparse probing / SHIFT / TPP, I think that by default precomputed model activations should be saved artifacts as well. That way the inner run_eval() function just takes a single SAE object, and we only need to compute model activations once and then we can reuse them. I'm not a huge fan of caching files because it introduces state and the potential for bugs, but it seems like a relatively simple way to reuse computed results between different SAEs.
I think that sounds reasonable and not too hard to implement. Will summarise next steps and include this. Agree about state. Tracking versions of packages helps a bit.
I think ideally if comparing a bunch of SAEs we would want to ensure that they are all being evaluated on the same artifacts.
Agreed! We could also get a hash of any files that we want to ensure are common between runs but it might not be worth prioritising yet.
The artifact folder structure looks good. Maybe add eval_type like this: f"{artifact_dir}/{eval_type}/{model}/{hook_point}/{artifact_id}". A possible way to store artifacts from different runs is to add an optional flag to set the artifact_dir, but otherwise it just defaults to something like "sae_bench_artifacts". Adding uuids sounds like it could quickly get very messy and use a lot of space if the artifacts are large.
But I think you had mentioned that you thought adding a flag for artifact dir was probably a bad idea? Not sure what the best path forward here is.
@adamkarvonen I don't remember saying that... it's possible I'm forgetting a good reason not to add it. At present, I don't see a strong reason not to make an artifact dir arg.
It's possible I misinterpreted this from your original post: "I was thinking about having a field for paths to artefacts to store but Johnny and I discussed this and decided this was a bad idea."
It's possible I misinterpreted this from your original post: "I was thinking about having a field for paths to artefacts to store but Johnny and I discussed this and decided this was a bad idea."
Oh that was a field in the output json, designed to be used by orchestration code to collect / store artefacts besides the core json which we think might not be a good a idea. Maybe we can avoid confusion by distinguishing "artefacts" <- any result that comes out of an eval (eg: images maybe, or probes) from cached results (like probes or activations we plan to reuse). Having a path in the CLI input to say where to get the cached results seems good.
Claude Summary:
Summary:
- The PR proposes changes to the SAE Bench template to improve scripting and standardization of evals
- Key proposed changes include:
- New regex-based SAE selection protocol
- Standardized CLI interface for running evals
- Consistent JSON output format with 1:1 relationship to evaluated SAEs
- Structured artifact storage system
- Inner
run_eval()
function for each eval that's decoupled from SAE LensMain agreements from discussion:
- Team agrees with proposed changes and future improvements
- Artifact storage structure agreed:
{artifact_dir}/{eval_type}/{model}/{hook_point}/{artifact_id}
- Precomputed model activations should be cached as artifacts for certain evals
- Default config values are being committed to each eval
- SAE Bench will remain a standalone repo for now (not integrated into SAE Lens)
Here are the suggested next steps I'd recommend:
Implementation Tasks:
- Create generic utils/tests to verify eval criteria compliance
- Update each eval to use common input/output format
- Implement consistent artifact caching system across evals
- Add CLI flags for all config variables
- Add linters/formatters to the repo since it's staying standalone (for now)
Documentation Updates:
- Document the standardized artifact storage structure
- Create guidelines for artifact caching and management
- Document the inner
run_eval()
function interfaceTesting:
- Separate unit and integration tests
- Remove dependency on cached results in tests
- Add tests for artifact management system
Future Considerations:
- Monitor feedback on inner
run_eval()
function from non-SAE Lens users- Consider implementing artifact version tracking system
- Evaluate need for hash verification of common artifacts between runs
@adamkarvonen Feel free to accept this PR, I'll start work on the tests which verify if an eval meets the above criterion and then make PRs for each eval.
@curt-tigges Depending on the tasks in your queue, aiding with this could speed things up.
This PR demo's a number of changes which I'd like to make to the template for SAE Bench.
Motivation
A number of nuances of the current SAE Bench template make it difficult to script running of the evals and use of their outputs.
Solution
A number of more / less superficial changes that we we should perpetuate across all all evals. I wanted to "do one myself" first to be sure that the changes made sense, and this is the result of that.
Requests from SAE Bench collaborators
Step 1. @adamkarvonen, @canrager, @curt-tigges Please read the rest of this since I'd like to confirm this makes sense / we can proceed. Comment on this thread and/or accept the PR. If there's anything controversial, I'm happy to get on a call. @chanind may also find it interesting. Step 2. Once we're agreed, I can write a few utils / tests that are more generic which we can use to check these criteria are met by each eval. Step 3. It won't take long, but happy to share the work to make each eval have the common input / output format. Most evals are pretty close already. They are mostly running fairly fast so I'm not concerned about run time for each.
Misc ask: If we haven't already got the preferred eval configs for each eval then we should commit those (maybe as the default values for each).
Details
Calling the Evals:
For example, we can call absorption with the following command
Output:
FAQ
How does this affect providing support for Non-SAE Lens SAEs?
To support non-SAE Lens SAEs, we need to get users to provide all of the info we usually get from SAE Lens which is a bit complicated. For example, if probes are hook point specific then we need to get that out of the SAE Config or pass it in. We also need to name inputs, ensure we're using the correct corresponding model. In general, SAE Lens config provides details needed to ensure you run the SAE correctly (eg: was it not trained on some token types or positions).
Mostly I think it's find to make an inner "run_evals" for each eval that doesn't care about all that, and assumes a class is instantiated / works correctly with encode / decode methods, but I did think it was worth explaining how for the purposes of SAE Bench on Neuronpedia, it's highly preferable to assume we're using SAE Lens and leverage config we've set up in that way.
If we've already made a few changes along these lines, my guess is they won't cause any big issues.
Unfinished / Possible future changes:
Suggested tasks: