[Proposal] Support multiple nodes per hook point

Currently we implicitly assume one node per hook point; however this is not the case.

For attention src_nodes, we have n_heads nodes
The activations are stored at the same hook point: blocks.{layer}.attn.hook_result

Proposed implementation:

SrcNodes should be able to define custom logic for how they extract their activation from their specified hook point(s).
- define a SrcNode.get_act method, which accepts the tensor of its input hook point and returns a tensor of shape (... d_model)
- For attention nodes, this will just amount to slicing the original tensor
- For other kinds of nodes, there may be nontrivial computation happening here (e.g. for SAE nodes)
Cache should store one activation / gradient per hook point instead of per node.
- We will need to lightly rewrite the indexing and expected cache tensor shapes to support this.
Add functionality to convert between the per-hook cache and the per-node cache.
- We can start off with a for-loop

Example code:

# Compute per-hook activations, gradients.  
acts_per_hook, grads_per_hook = compute_activations_and_gradients_simple(
    model, handler
)

# Convert this to per-node acts, grads. 
graph_acts, graph_grads = ... 

scores = compute_attribution_scores(
    graph_acts, graph_grads, model.cfg, aggregation=aggregation
)

dtch1997 / sae-eap

[Proposal] Support multiple nodes per hook point #5