Replicating self-attention maps from Fig. 4a of publication

Hi,

I'm trying to replicate Chromoformer's self-attention map as analysed in Fig. 4a of your publication. The description given in the results is:

the attention weights produced by the Embedding transformer of Chromoformer-clf during the prediction were visualized to analyze the internal behavior of the model.

For which two attention heads are used. However, there appears to only be one head shown in a trained version of the clf model from the github demo for the embedding transformers (2000 resolution shown below):

seed = 123
bsz = 32
i_max = 8
w_prom = 40000
w_max = 40000
n_feats = 7
d_emb = 128
embed_kws = {
    "n_layers": 1,
    "n_heads": 2,
    "d_model": 128,
    "d_ff": 128,
}
pairwise_interaction_kws = {
    "n_layers": 2,
    "n_heads": 2,
    "d_model": 128,
    "d_ff": 256,
}
regulation_kws = {
    "n_layers": 6,
    "n_heads": 8,
    "d_model": 256,
    "d_ff": 256,
}
d_head = 128
model_clf = ChromoformerClassifier(
    n_feats, d_emb, d_head, embed_kws, pairwise_interaction_kws, regulation_kws, seed=seed
)
model_clf

output (partial - showing just the 2k embedding transformer):

ChromoformerBase(
  (embed): ModuleDict(
    (2000): EmbeddingTransformer(
      (lin_proj): Linear(in_features=7, out_features=128, bias=False)
      (transformer): Transformer(
        (layers): ModuleList(
          (0): AttentionBlock(
            (self_att): MultiHeadAttention(
              (w_bias): Linear(in_features=2, out_features=2, bias=False)
              (att): Linear(in_features=128, out_features=384, bias=False)
              (ff): Linear(in_features=128, out_features=128, bias=True)
              (ln): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
            )
            (ff): FeedForward(
              (l1): Linear(in_features=128, out_features=128, bias=True)
              (l2): Linear(in_features=128, out_features=128, bias=True)
              (ln): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
            )
          )
        )
      )
    )

I've tried using an approach like this using register_forward_hook() but given there appears to be only one attention head in the printed model layers, I can only get the model.embed2000.transformer.layers[0].self_att or model.embed2000.transformer.layers[0].self_att.att matrix. How did you get the two matrices like in the publication from this? Did you use the self_att or specifically the self_att.att matrix or something else?

Thanks!

dohlee / chromoformer

Replicating self-attention maps from Fig. 4a of publication #10