No MLP block in THP's transformer

kevindoran commented 1 month ago

The shared EncoderLayer used by a few models, although I've only looked at TorchTHP, has a use_residual flag that defaults to False, and I don't think it is set to True on installation of TorchTHP, which should have both attention and MLP in each transformer layer.

When False the use_residual flag causes the feed-forward block to be skipped (torch_baselayer.py#L85).

A little test:

from easy_tpp.model import TorchTHP

class Struct:
    def __init__(self, **entries):
        self.__dict__.update(entries)

model_config = {
    "hidden_size": 512,
    "time_emb_size": 512,
    "use_norm": True,
    "use_ln": True,
    "num_layers": 4,
    "num_heads": 4,
    "dropout_rate": 0.1,
    # Errors if we don't define these:
    "loss_integral_num_sample_per_step": 10,
    "num_event_types": 1,
    "num_event_types_pad": 1,
    "thinning": False,
    "gpu": 0,
    "pad_token_id": 0}
TorchTHP(Struct(**model_config))

Outputs:

THP(
  (layer_type_emb): Embedding(1, 512, padding_idx=0)
  (layer_temporal_encoding): TimePositionalEncoding()
  (layer_intensity_hidden): Linear(in_features=512, out_features=1, bias=True)
  (softplus): Softplus(beta=1.0, threshold=20.0)
  (stack_layers): ModuleList(
    (0-3): 4 x EncoderLayer(
      (self_attn): MultiHeadAttention(
        (linears): ModuleList(
          (0-2): 3 x Linear(in_features=512, out_features=512, bias=True)
        )
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
  )
)

As can be seen, the EncoderLayer contains self_attn and nothing else.

iLampard commented 1 month ago

hi, let me have a look

iLampard commented 1 month ago

Hi,

i have add a feedforward layer into the Encoder block

self.feed_forward = nn.Sequential(
    nn.Linear(self.d_model, self.d_model * 2),
    nn.ReLU(),
    nn.Linear(self.d_model * 2, self.d_model)
)

self.stack_layers = nn.ModuleList(
    [EncoderLayer(
        self.d_model,
        MultiHeadAttention(self.n_head, self.d_model, self.d_model, self.dropout,
                           output_linear=False),

        use_residual=False,
        feed_forward=self.feed_forward,
        dropout=self.dropout
    ) for _ in range(self.n_layers)])

https://github.com/ant-research/EasyTemporalPointProcess/blob/main/easy_tpp/model/torch_model/torch_thp.py#L41

kevindoran commented 1 month ago

I see the MLP present by default now:

THP(
  (layer_type_emb): Embedding(1, 512, padding_idx=0)
  (layer_temporal_encoding): TimePositionalEncoding()
  (layer_intensity_hidden): Linear(in_features=512, out_features=1, bias=True)
  (softplus): Softplus(beta=1.0, threshold=20.0)
  (feed_forward): Sequential(
    (0): Linear(in_features=512, out_features=1024, bias=True)
    (1): ReLU()
    (2): Linear(in_features=1024, out_features=512, bias=True)
  )
  (stack_layers): ModuleList(
    (0-3): 4 x EncoderLayer(
      (self_attn): MultiHeadAttention(
        (linears): ModuleList(
          (0-2): 3 x Linear(in_features=512, out_features=512, bias=True)
        )
        (dropout): Dropout(p=0.1, inplace=False)
      )
      (feed_forward): Sequential(
        (0): Linear(in_features=512, out_features=1024, bias=True)
        (1): ReLU()
        (2): Linear(in_features=1024, out_features=512, bias=True)
      )
    )
  )
)

By the way, if you want to be able to reproduce the paper results, it is noteworthy that the THP paper had the inner dimension as being configurable: this is clearest by looking in the Supplemental where there are 3 parameter sets, and if I understand correctly, the inner dimension is not a hard-coded multiple of the model dimension. If the repo doesn't intend to reproduce the paper design, it's probably worth making a note somewhere in the documentation about this.

iLampard commented 1 month ago

Hi,

We are trying to reproduce the design and welcome issues to help us achieve it. We will try to make the configuration more flexible to accommodate various scenarios.

ant-research / EasyTemporalPointProcess

No MLP block in THP's transformer #39