A few ambiguities for replicating results

babylonhealth / neuralTPPs

Apache License 2.0

27 stars 8 forks source link

Hello,

First, thank you for your very-well written code, which made it easy for me to get started with it. I managed to replicate one of your results: Table 3- synthea(full)-GRU encoder+CP decoder, AUROC ~= 0.85

However, I wanted to test whether it is really useful to use PP loss function or not. Hence, I removed the integral effect in enc_dec.neg_log_likelihood:

intensity_integral = intensity_integral[:, :-1]*0 # [B,L]

As a result, we could say that our loss function will reduce to the simple cross entropy (for multi-class) or binary cross entropy (for multi-label). Surprisingly, I saw no performance degradation, which might indicate that the integral term (and hence point process loss) has no effect.

What is more interesting is that for the Retweets dataset, I could achieve auroc=0.68 (in paper is 0.61) when omitting integral term!

Another issue for me is the way you have reported AUROC for label prediction. In the literature, researchers tend to report metrics (acc-f1-auroc, ...) for next event prediction, but in your code, it seems to me that you use the information including $t_j$ for predicting $j-th$ mark itself.

Hi!

First sorry for the very late reply, I left Babylon a few months ago and didn't get a notification for this. And thanks for your interest in our work!

It's surprising that the integral term has no effect on the result. If you check Figure 2 of our paper, you'll see that the model is able to pick up regular events, which is not possible with a simple conditional poisson. So at least the NLL loss should be better.

As for the Retweets dataset, the auroc was already better using CP rather than the full TPP model, so I'm not surprised that omitting the integral term works even better. This dataset is probably not modelled best with TPPs.

About your 2nd point, we model the joint distribution to predict both time and type of event. But when computing the auroc, we use the true time of the next event $t_j$ to actually compare if the model predict the correct $j-th$ mark. Therefore, this metric is limited in that it can only be used to check of the correct mark is predicted. The NLL loss should as a result be preferred when comparing models.

Best, Joseph

babylonhealth / neuralTPPs

A few ambiguities for replicating results #9