ant-research / EasyTemporalPointProcess

EasyTPP: Towards Open Benchmarking Temporal Point Processes
https://ant-research.github.io/EasyTemporalPointProcess/
Apache License 2.0
239 stars 25 forks source link

A data leak in NHP? #32

Closed ivan-chai closed 2 months ago

ivan-chai commented 3 months ago

Hi! Thank you for sharing implementations of popular TPP methods.

I try to understand the NHP implementation and I'm confused by the following line:

dt = time_delta_seq[:, i + 1]

An input sequence contains time deltas and the first delta is zero. When we take i+1, we actually take the time to the next event and use it for predicting the output at position i. This way we use the next timestamp at the current step. Is it a data leak or I'm missing something? The same is possibly true for the original Theano NHP implementation.

iLampard commented 3 months ago

Hi,

In 'forward' function, we use the next timestamp to compute how the state decays in the future, which will be the input to compute the updated state when the next event happens.

In prediction, we compute the sampled inter-event time points (based on time_delta_seq) and then compute the expectation of intensities over these sampled time points. Theoretically we should sample inter-event time points from time 0 to time +inf, but for computational convenience, the upper bound is from time_delta_seq, i.e., inter-event time to the next future event.

ivan-chai commented 2 months ago

Training

The time_delta_seq sequence stores time delta from the last event. When we take time_delta_seq[:, i + 1], we actualy take the delta to the next event. Then we apply rnn_cell, which computes c_i and c_bar_i without dt. After that we use dt, c_i, and c_bar_i in decay function to compute h_t, which is the models' output at step i. During likelihood computation, we use models' output to compute event_ll. This way event_ll at step i depends on the time of the future event during training.

Inference

During inference, we use the predict_one_step_at_every_event function from TarchBaseMadel. Inside this function we see time_delta_seq = time_delta_seq[:, 1:], so time_delta_seq stores time to the next event. Then we use compute_intensities_at_sample_times from NHP. The latter doesn't use outputs, but states. States doesn't depend on the future event, but they are computed with a modified time_delta_seq from predict_one_step_at_every_event. So inference seems to also look one step forward.

iLampard commented 2 months ago

Hi, Ivan, let me think about it and get back to you soon.

iLampard commented 2 months ago

Hi Ivan,

Please see my response below,

Training

The time_delta_seq sequence stores time delta from the last event. When we take time_delta_seq[:, i + 1], we actualy take the delta to the next event. Then we apply rnn_cell, which computes c_i and c_bar_i without dt. After that we use dt, c_i, and c_bar_i in decay function to compute h_t, which is the models' output at step i. During likelihood computation, we use models' output to compute event_ll. This way event_ll at step i depends on the time of the future event during training.

In the loglike, we try to max the proba of event happening. event_ll at step i corresponds to the future event computed from time i -- we explore the case if we decay the delta time to the next event, can we get a future event that match with the real event that happens. We use the type_mask to do this (in this line, i pass the future event as the type_mask.). If matches, we get 1 otherwise 0.

So the training code seems fine.

Inference

During inference, we use the predict_one_step_at_every_event function from TarchBaseMadel. Inside this function we see time_delta_seq = time_delta_seq[:, 1:], so time_delta_seq stores time to the next event. Then we use compute_intensities_at_sample_times from NHP. The latter doesn't use outputs, but states. States doesn't depend on the future event, but they are computed with a modified time_delta_seq from predict_one_step_at_every_event. So inference seems to also look one step forward.

I understand you mean we use the future inter-event time to compute the sampling, right? I agree that a better way is to set the inter-event time to be [0, \ infty] that is independent of 'true inter-event time'. I used to do it in the first version but the this results in litter difference in practice and then i just I followed AttNHP. If you find this generates very different result in the experiments, i will correct it.

ivan-chai commented 2 months ago

Thank you for the detailed response! Yes, training is OK. For inference we can use a predefined dtime_max as a time delta bound to prevent the inference data leak.

iLampard commented 2 months ago

Thank you for the detailed response! Yes, training is OK. For inference we can use a predefined dtime_max as a time delta bound to prevent the inference data leak.

yes, this is better. i will improve it in the next version. Or you can make a PR.