Closed PACerv closed 2 years ago
Hi, Sorry for the late reply. Are the mean metrics values closed to these reported in the paper? The confidence interval was calculated by 1.96 \sigma / \sqrt{n}, \sigma here is the standard deviation of 20 trials, n is 20, the replication times.
Yes, all actions are uniformly sampled in our experiments. If we sample them according to the action distribution of the datasets, the metric values should be much lower.
Here is a reference link of confidence interval: https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Confidence_Intervals/BS704_Confidence_Intervals_print.html
When evaluating the provided models for HumanAct12 and CMU Mocap (vanilla_vae_lie_mse_kld01) I get a significantly larger confidence interval. For the FID and accuracy score, I can increase the number of generated motions (more then 30000 are needed) to approximately reach the reported width of the 95% confidence interval. However, this doesn't affect the confidence interval of the diversity or multimodality measure. With the settings described in the paper I get confidence intervals up to 5 - 6 times larger. When I increase the set size for these measure beyond the numbers described in the paper (200 and 20 respectively) the confidence interval somewhat shrinks. Could you explain how the confidence interval for the evaluation metrics was computed?
Also regarding the FID and diversity score: As far as I can tell, the evaluation script samples all actions uniformly but the datasets may have some class imbalance. Doesn't this difference inflate the FID and diversity score?