SimiaoZuo / Transformer-Hawkes-Process

Code for Transformer Hawkes Process, ICML 2020.
MIT License
174 stars 48 forks source link

Should event likelihood be computed using current or last hidden state? #10

Open mistycheney opened 2 years ago

mistycheney commented 2 years ago

Suppose the transformer hidden state at event i is h_i, should the likelihood of this event be computed using hi or h{i-1}?

Using h_{i-1} makes more sense to me because this will encourage model to assign high intensity to the true next event, therefore learn to forecast.

But the implementation and the paper seem to be using h_i. The problem is that, since the transformer is given the true event i as part of the input, it can simply learn to output infinitely high intensity for the correct event type in order to maximize the likelihood. Still, the learned model will have no predictive power.

I feel I must have missed something. Any clarification is appreciated. Thanks.

AnthonyChouGit commented 2 years ago

I have the same question as @mistycheney . In this piece of code, the likelihood is calculated using h_i, which has already encoded the i-th event. This would lead the model trying to maximize the likelihood of the event type of the i-th event, and minimizing likelihood of all other event types at this point. Does this explain the dramatic decrease of negative log-likelihood, as presented in the paper (Table 4)? I think maybe this part of the code is not written correctly?

waystogetthere commented 1 year ago

Exactly, I think this is an error. And there are may different details in the code.

This is the function to calculate the log-likelihood: https://github.com/SimiaoZuo/Transformer-Hawkes-Process/blob/e1fd7ac0a62f2cb674ba64faae889327b931e62c/Utils.py#L58 There are several inputs:

model: the Transformer data: the raw output of the Model, which needs to go through a linear layer to get the hidden state. time: the occurring event time, shape: [BATCH, SEQ_LEN] types: the occurring event type, shape: [BATCH, SEQ_LEN]

Preliminary: Two Masks

Please refer to line 61~65, there are 2 masks that # non_pad_mask.shape=[BATCH, SEQ_LEN] indicates the padding position in the batch. This is batch training and sequences with different lengths in one batch are quite common.

# type_mask.shape=[BATCH, SEQ_LEN, NUM_TYPES] The type_mask includes a one-hot encoding indicating what type occurs at each position.

Event-likelihood

Get the hidden state, calculate the intensity of every type of event at every position, and only extract the truly occurring ones. Please refer to line67~69

all_lambda.shape = [BATCH, SEQ_LEN, NUM_TYPES] different type has different intensity type_lambda.shape=[BATCH, SEQ_LEN] only extract the ground-truth type

Then apply the log function and sum up all. Please refer line72~73

event_ll.shape=[BATCH]

HERE COMES THE FIRST ERROR, note that the i-th event's intensity is: $f_k(h_i)$, where $f_k$ is the soft-plus function. It is totally different from the paper:

image

where for the event $t_i$ its intensity should be: $\lambda(t_i) = f_k(\alpha \frac{ti-t{i-1}}{ti} + \bf w\bf h{i-1} + b)$ It does not include the 'current' term and uses the current hidden state: $\bf hi$ instead of the last hidden state: $\bf h{i-1}$

Non-Event Likelihood

The code set Monte-Carlo Method as default to calculate the integral of the intensity function:

image

The essential idea is that, during every inter-event time, uniformly sample N points and calculate their intensity, then use their mean as representation intensity during $[tj, t{j-1}]$. However, when calculating the intensity, it still uses the current hidden state $\bf hj$ instead of the last hidden state: $\bf h{j-1}$.