JuliaGaussianProcesses / GPLikelihoods.jl

Provides likelihood functions for Gaussian Processes.
https://juliagaussianprocesses.github.io/GPLikelihoods.jl/
MIT License
43 stars 5 forks source link

Categorical takes C-1 inputs for C classes #55

Closed rossviljoen closed 2 years ago

rossviljoen commented 2 years ago

Currently, the CategoricalLikelihood is defined to take a vector of C-1 inputs to produce a distribution with C classes by appending a 0 to the input vector before going through the softmax.

This is fine for simple stuff, but I think there are some cases where you'd want to give all C inputs - e.g. if you're doing multi-class classification with a different kernel for each class, I don't think this would be possible with the current version?

Should I make a PR to change it or is there a reason to keep it as is? (would always still be possible to use the current version by appending a zero yourself)

@willtebbutt

devmotion commented 2 years ago

The motivation for fixing one input was to ensure that the mapping is invertible: we map C-1 inputs to the C-1 dimensional simplex. It is the natural generalization of the logistic function, as used e.g. in multinomial logistic regression. I can see though that it can be a bit inconvenient sometimes.

rossviljoen commented 2 years ago

What do you think then - change it or leave as is?

theogf commented 2 years ago

I have been dealing with categorical likelihoods again recently and I think both are just as valid (interestingly having C inputs adds an unnecessary degree of freedom, and I am not sure what the effect on inference are). I wanted to add that the choice of the added input (0) should be changeable by the user.

I will make a PR to allow for all these options, maybe I can find an elegant formulation.

Related to this is #58

theogf commented 2 years ago

Solved by #61 I believe

theogf commented 2 years ago

@devmotion Could you comment on the exchangability of the classes when using the C-1 inputs? Would it still be valid?

devmotion commented 2 years ago

I'm not sure, what exactly do you mean?

theogf commented 2 years ago

In the C inputs, C classes case I can interchange any class by interchanging the input. Right? But is it also true for C - 1 inputs? In another words, is a simplex invariant under permutations?

devmotion commented 2 years ago

If you interchange two of the C-1 inputs, then the probabilities of the corresponding two classes will be interchanged as well. And if you want to interchange some class with the reference class, you can either change the reference class or set its input to the additive inverse and subtract it from all other inputs. Is that what you're after?

E.g., if C = 3, then in the case of C - 1 = 2 inputs the vector of class probabilities is computed as softmax([input1, input2, 0]) (by our convention for the reference class). So if you swap input1 and input2, then the probabilities for the first and the second class are swapped. If you want to swap e.g. the first and the third class, then you could just use the first class as reference class instead of the third one. Alternatively, since softmax is shift-invariant we have softmax([0, input2, input1]) = softmax([-input1, input2 - input1, 0]), i.e., you can multiply the first input by -1 and subtract it from all other inputs to interchange the probabilities of classes 1 and 3, without changing the reference class or changing the other class probabilities.

theogf commented 2 years ago

Thanks, that is really insightful. My PI was having doubts on this version and was arguing about the exchangeability but I could not find proper arguments.

So interestingly I made a few experiments with my logistic-softmax link. On a simple 1-D example I generate data with C-1 input, and fit it with both C-1 and C GPs. The C-1 inputs provide consistently a better estimate of the true categorical probabilities but the log-likelihood is worse than with C inputs!

devmotion commented 2 years ago

The C-1 is common in multinomial logistic regression (and, of course, logistic regression): https://en.wikipedia.org/wiki/Multinomial_logistic_regression#As_a_set_of_independent_binary_regressions With C-1 inputs one also has the nice interpretation of the inputs as log odds which is lost in case of C inputs.

theogf commented 2 years ago

Sure! I think he was directly having in mind processes where the order matters like the stick-breaking process https://en.wikipedia.org/wiki/Dirichlet_process#The_stick-breaking_process but probably got confused