Closed MengyuanChen21 closed 1 year ago
Dear Mengyuan,
Thank you very much for your attention and appreciation of our work. Here are my responses to your questions:
The key point of Section 3.1 is in the first sentence of the second paragraph: "when trained with MSE loss, the evidential network proposed by EDL can be understood as a new probabilistic graphical model."
Re-understanding EDL from the perspective of probabilistic graphical models is a new viewpoint we proposed in our work (refer to Figure 2, removing the orange line as a probabilistic graphical model of EDL).
The reason we assume the observed label $\boldsymbol{y}$ follows an isotropic Gaussian distribution is mainly because the EDL optimized by minimizing the expected MSE is equivalent to maximizing the expected likelihood of the observed labels $\boldsymbol{y}$, as shown below: $\underset{\boldsymbol{\theta}}{\arg \max} \mathbb{E}_{(\boldsymbol{x}, \boldsymbol{y}) \sim \mathcal{P}}\left[\log \mathbb{E}_{\boldsymbol{p} \sim \text{Dir}(\boldsymbol{\alpha})}\left[\mathcal{N}\left(\boldsymbol{y} \mid \boldsymbol{p}, \sigma^2 \boldsymbol{I}\right)\right]\right]=\underset{\boldsymbol{\theta}}{\arg \min } \mathbb{E}_{(\boldsymbol{x}, \boldsymbol{y}) \sim \mathcal{P}} \mathbb{E}_{\boldsymbol{p} \sim \text{Dir}(\boldsymbol{\alpha})}\left[(\boldsymbol{y}-\boldsymbol{p})^T(\boldsymbol{y}-\boldsymbol{p})\right]$
If $\boldsymbol{y}$ follows a Gaussian, how to satisfy the one-hot vector? From the equivalence above, it also constrains the model generated $\boldsymbol{p}$ to give one-hot outputs. From a graphical model perspective, $\boldsymbol{y}$ can be seen as the expected value of sampling many times from a Gaussian with mean as $\boldsymbol{p}$ and variance as $\sigma^2 \boldsymbol{I}$, where $\boldsymbol{p}$ is also sampled many times from a Dirichlet distribution. We constrain this generated expected value to be consistent with the ground truth.
Hope the above content can answer your questions. Thank you again for your interest in our work. Please feel free to discuss if you have any other questions.
Thanks so much for the reply! Now I have a much clearer understanding about Section 3.1.
However, I still have a question about the application of the PAC-Bayesian bound in Section 3.3. It seems that, the last term in Theorem 3.1, $\Psi_{\mathcal{P},\pi}(\lambda,n)$ is omitted in Eq.(4). Could you shed some light on the rationale behind this particular omission? I have carefully read the supplementary material, but I am still confused about it.
Thanks again for your time and consideration!
The PAC-Bayesian bound (Theorem 3.1) is derived from the theorems of Germain et al. (2009), Alquier et al. (2016), Masegosa (2020). The omission of $\Psi{\mathcal{P},\pi}(\lambda,n)$ also stems from the conclusions of these references. More specifically, Section 4 of Alquier et al. (2016) and Section 3.2 of Masegosa (2020) both state that the term $\Psi{\mathcal{P},\pi}(\lambda,n)$ is constant w.r.t. $\rho$. Actually, this conclusion can be obtained directly from the equation of $\Psi_{\mathcal{P},\pi}(\lambda,n)$.
OK, I get it now. Thanks so much for the detailed reply! Thanks for your time and consideration!
OK, I get it now. Thanks so much for the detailed reply! Thanks for your time and consideration!
It's my pleasure~
Dear Authors, I would like to express my admiration for the extraordinary work you have done, which has contributed significantly to the field. Your detailed research and insightful analysis are greatly appreciated.
Upon careful reading of your esteemed publication, I came across a section that elicited a few questions regarding its underpinnings. In Section 3.1, there is a reference to the work of Sensoy et al., (2018), which suggests that EDL assumes the observed labels, denoted by $y$, to be independent and identically distributed from an isotropic Gaussian distribution, i.e., $y\sim\mathcal{N}(p,\sigma^2I)$, where $p\sim Dir(f_\theta(x)+1)$.
The aspect that I found perplexing relates to the encoding of $y$ as a one-hot vector, as stated in your paper. It is not entirely clear to me how a one-hot vector can adhere to a Gaussian distribution.
Furthermore, I noticed an apparent departure from the original work of Sensoy et al. As per my understanding, their research does not appear to propose any associations with an isotropic Gaussian distribution.
Could you kindly shed some light on this apparent discrepancy? I would greatly appreciate any insights you could provide on how the Gaussian distribution is tied to Sensoy et al.'s work, or if the assumption is a modification or extension of the original model that better suits your research objectives.
Thank you in advance for your time and consideration. I look forward to your valued response.