Closed ivan-chai closed 2 months ago
Hi,
In 'forward' function, we use the next timestamp to compute how the state decays in the future, which will be the input to compute the updated state when the next event happens.
In prediction, we compute the sampled inter-event time points (based on time_delta_seq) and then compute the expectation of intensities over these sampled time points. Theoretically we should sample inter-event time points from time 0 to time +inf, but for computational convenience, the upper bound is from time_delta_seq, i.e., inter-event time to the next future event.
The time_delta_seq
sequence stores time delta from the last event. When we take time_delta_seq[:, i + 1]
, we actualy take the delta to the next event. Then we apply rnn_cell
, which computes c_i
and c_bar_i
without dt
. After that we use dt
, c_i
, and c_bar_i
in decay
function to compute h_t
, which is the models' output at step i
. During likelihood computation, we use models' output to compute event_ll
. This way event_ll
at step i
depends on the time of the future event during training.
During inference, we use the predict_one_step_at_every_event
function from TarchBaseMadel. Inside this function we see time_delta_seq = time_delta_seq[:, 1:]
, so time_delta_seq
stores time to the next event. Then we use compute_intensities_at_sample_times
from NHP. The latter doesn't use outputs, but states. States doesn't depend on the future event, but they are computed with a modified time_delta_seq
from predict_one_step_at_every_event
. So inference seems to also look one step forward.
Hi, Ivan, let me think about it and get back to you soon.
Hi Ivan,
Please see my response below,
Training
The
time_delta_seq
sequence stores time delta from the last event. When we taketime_delta_seq[:, i + 1]
, we actualy take the delta to the next event. Then we applyrnn_cell
, which computesc_i
andc_bar_i
withoutdt
. After that we usedt
,c_i
, andc_bar_i
indecay
function to computeh_t
, which is the models' output at stepi
. During likelihood computation, we use models' output to computeevent_ll
. This wayevent_ll
at stepi
depends on the time of the future event during training.
In the loglike, we try to max the proba of event happening. event_ll
at step i
corresponds to the future event computed from time i
-- we explore the case if we decay the delta time to the next event, can we get a future event that match with the real event that happens. We use the type_mask
to do this (in this line, i pass the future event as the type_mask
.). If matches, we get 1 otherwise 0.
So the training code seems fine.
Inference
During inference, we use the
predict_one_step_at_every_event
function from TarchBaseMadel. Inside this function we seetime_delta_seq = time_delta_seq[:, 1:]
, sotime_delta_seq
stores time to the next event. Then we usecompute_intensities_at_sample_times
from NHP. The latter doesn't use outputs, but states. States doesn't depend on the future event, but they are computed with a modifiedtime_delta_seq
frompredict_one_step_at_every_event
. So inference seems to also look one step forward.
I understand you mean we use the future inter-event time to compute the sampling, right? I agree that a better way is to set the inter-event time to be [0, \ infty]
that is independent of 'true inter-event time'. I used to do it in the first version but the this results in litter difference in practice and then i just I followed AttNHP. If you find this generates very different result in the experiments, i will correct it.
Thank you for the detailed response! Yes, training is OK. For inference we can use a predefined dtime_max as a time delta bound to prevent the inference data leak.
Thank you for the detailed response! Yes, training is OK. For inference we can use a predefined dtime_max as a time delta bound to prevent the inference data leak.
yes, this is better. i will improve it in the next version. Or you can make a PR.
Hi! Thank you for sharing implementations of popular TPP methods.
I try to understand the NHP implementation and I'm confused by the following line:
An input sequence contains time deltas and the first delta is zero. When we take
i+1
, we actually take the time to the next event and use it for predicting the output at positioni
. This way we use the next timestamp at the current step. Is it a data leak or I'm missing something? The same is possibly true for the original Theano NHP implementation.