Computing ADE/FDE when compared with other methods

pedro-mgb commented 3 years ago

This issue has been present in the past (#14 #27 #30), but I felt like it would be best to create another issue rather than commenting on closed ones.

I did some changes on the social GAN code, to compute the ADE and FDE metrics in the same way Social-STGCNN does (see this issue on sgan repo) - Picking the smallest error among all the samples per trajectory, instead of the overall smallest error for the entire scene/sequence.

I leave below a table comparing Social-STGCNN (results from the paper) with SGAN-P-20 (as in the paper), and also, a simpler baseline - a 'multimodal' constant velocity. I can explain it in more detail if you want, but basically the constant velocity model outputs 20 samples of trajectories with constant velocity, where for each sample the module of the velocity is weighted using a normal distribution based on the velocities of the observed trajectory.

Model	ETH	HOTEL	UNIV	ZARA1	ZARA2	AVG
Const vel	0.46 / 0.70	0.14 / 0.23	0.31 / 0.59	0.28 / 0.54	0.20 / 0.40	0.28 / 0.49
SGAN-P	0.59 / 0.92	0.34 / 0.66	0.33 / 0.60	0.23 / 0.42	0.22 / 0.39	0.34 / 0.60
Social-STGCNN	0.64 / 1.11	0.49 / 0.85	0.44 / 0.79	0.34 / 0.53	0.30 / 0.48	0.44 / 0.75

According to this, not only does SGAN-P outperform Social-STGCNN, but a multi-modal constant velocity seems to outperform both. This was also touched on another issue in sgan repository - originating from the paper What the Constant Velocity Model Can Teach Us About Pedestrian Motion Prediction (https://arxiv.org/abs/1903.079339). Although the multimodal constant velocity they employ is different than mine, it also outperforms Social GAN.

I'd like to get someone's opinion on this matter, because as of right now a multi modal version of constant velocity is achieving competite results with the state-of-the-art. This leads to many questions, many of which have been discussed, but I fear no consensus has been reached. I'll leave a few here:

Are the datasets in which these models are based representitive of the huge complexity of human motion and human interactions?
Are the models actually learning meaningful information about interactions between humans, or is it just "making things worse"?
Is this evaluation process enough to compare the different models? For instance some models and benchmarks have been using metrics that take into account collisions between pedestrians. I assume (or hope) that the social models will have better performance in such metrics than the constant velocity method, but I have not done enough experiments in that regard.

Thank you for reading this. Have a good day!

abduallahmohamed commented 3 years ago

Hi @pedro-mgb

Thanks for this.

First, social-gan can not use our evaluation method because of the way it generates the results. Social-gan is a generative model in which the generated samples are correlated, thus judging it as best scene is suitable. In our case, we generate a distribution parameters, then we sample from these. The CV might be valid to these datasets (not fully aware of it) because the datasets are old and not complex enough. I'd prefer for any upcoming work to use https://www.aicrowd.com/challenges/trajnet-a-trajectory-forecasting-challenge which is rich enough with more complex situations and better annotations. I think this answers your first bullet point. For the second point, I think the only way to evaluate this is by qualitative analysis. Also, if you want to use these models in a real-life applications you will need lot of conditions around it. I don't believe it makes things worse, all of them are approaches to a complex problems with each method has it is own shortcomings. For the third point, The best of N metrics (FDE -20 , ADE -20) are not suitable to judge the performance. Why 20? ...etc? This article http://ai.stanford.edu/blog/trajectory-forecasting/ discuss this point extensively.

Let me know if you have more questions

pedro-mgb commented 3 years ago

Thank you for the response. I agree with what you said.

Multimodal CV may have the "best" prediction among 20 samples, but if we look at the errors from the other predictions (e.g. on average, the top-X samples), or use a NLL loss, like it is discussed on that article -> We will see that CV looks much worse than your model or social gan or an LSTM.

Regarding Trajnet++, I think it's a step in the right direction to having some form of standard. But I believe the trajectory forecasting problem using data-driven models is still just taking its first steps.

I don't really have any other questions. Thank you, once gain. Feel free to close the issue, if you want.

abduallahmohamed / Social-STGCNN

Computing ADE/FDE when compared with other methods #47