Extending `BCEWithLogitsLoss` to non-binary labels

BackPACK's extensions that rely on the probabilistic interpretation of a loss function as a negative log likelihood (quantities based on the Fisher, i.e. BatchDiagGGNMC, DiagGGNMC, SqrtGGNMC, KFAC) are limited to binary labels for BCEWithLogitsLoss.

This issue serves as documentation for the required steps and problems to support continuous-valued labels.

Description: Currently, we assume binary labels $y_n \in {0; 1}$. In this case, BCEWithLogitsLoss corresponds to the negative log likelihood of a Bernoulli distribution $p(y \mid fn)$ with $f{n} \in (0; 1)$ the sigmoid probability.

But BCEWithLogitsLoss also supports continuous labels $yn \in [0; 1]$. In this case, BCEWithLogitsLoss corresponds to negative log likelihood of a continuous Bernoulli distribution $p(y \mid f{n}) \propto f_{n}^{y} (1 - fn)^{1 - y}$, such that $- \log p(y=y{n} \mid f{n}) \propto -y{n} \log(f_n) - (1 - y_n) \log(1 - f_n)$.

Implementation: Depending on the nature of labels (binary or continuous), a different distribution must be used (Bernoulli or continuous Bernoulli) to compute sampled gradients. However, at the moment the _make_distribution function does not take into account the labels, but only receives the subsampled inputs. Hence, the interface must be adapted in order to support continuous labels in BCEWithLogitsLoss.

Problems:

A problem with that is that this approach would determine at run time, which properties the labels satisfy. If however we're using a data set with non-binary labels, but coincidentally feed a batch with binary labels (or a single sample), then this approach will use the wrong distribution. Not sure how to fix this, other than asking the user for the nature of their data.

f-dangel / backpack

Extending `BCEWithLogitsLoss` to non-binary labels #281