Why using reparameterizaiton-trick for classification loss computation

Hi author, Thanks for your great work. After reading the paper, I have a question regarding the computation of the classification loss.

It's clear that, for the localization loss, you regress the mean of GMM w.r.t the offset of the anchor to GT boxes, and use the variance term to predict the un-certainty of the offset prediction, and optimize this procedure with Log-likelihood Maximizaiton.

However, for classification, as far as I am concerned, you do the optimization in another way. You treat the input data as random variable, and by using the re-parameterization trick, you get the sampled class-specific random variable from the learned GMM, and finally compute the BCE loss between GT and re-parameterized random variable.

My question is that, why do this in this way? Can we just compute the classification loss in a similar way as the localization? Doing something like Maximize the likelihood of pos, and neg samples given the predicted mean and variance of GMM, like N(GT_pos | mu_p, Sigma_p), N(GT_neg | mu_p, Sigma_p), where mu_p and Sigma_p are computed by the network.

I hope my puzzel could be considered,

Best regards

NVlabs / AL-MDN

Why using reparameterizaiton-trick for classification loss computation #12