Confidence intervals seem to differ from reported numbers

PACerv commented 3 years ago

When evaluating the provided models for HumanAct12 and CMU Mocap (vanilla_vae_lie_mse_kld01) I get a significantly larger confidence interval. For the FID and accuracy score, I can increase the number of generated motions (more then 30000 are needed) to approximately reach the reported width of the 95% confidence interval. However, this doesn't affect the confidence interval of the diversity or multimodality measure. With the settings described in the paper I get confidence intervals up to 5 - 6 times larger. When I increase the set size for these measure beyond the numbers described in the paper (200 and 20 respectively) the confidence interval somewhat shrinks. Could you explain how the confidence interval for the evaluation metrics was computed?

Also regarding the FID and diversity score: As far as I can tell, the evaluation script samples all actions uniformly but the datasets may have some class imbalance. Doesn't this difference inflate the FID and diversity score?

EricGuo5513 commented 3 years ago

Hi, Sorry for the late reply. Are the mean metrics values closed to these reported in the paper? The confidence interval was calculated by 1.96 \sigma / \sqrt{n}, \sigma here is the standard deviation of 20 trials, n is 20, the replication times.

Yes, all actions are uniformly sampled in our experiments. If we sample them according to the action distribution of the datasets, the metric values should be much lower.

EricGuo5513 commented 3 years ago

Here is a reference link of confidence interval: https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Confidence_Intervals/BS704_Confidence_Intervals_print.html

EricGuo5513 / action-to-motion

Confidence intervals seem to differ from reported numbers #6