chengsoonong / mclass-sky

Multiclass methods for astronomical data
BSD 3-Clause "New" or "Revised" License
9 stars 4 forks source link

use feature uncertainty #167

Open chengsoonong opened 7 years ago

chengsoonong commented 7 years ago

From Chris Wolf:

Assuming error-free models for K classes comprising N_k members with M features x_i,j,k where i in [1,N], j in [1,M] and k in [1,K], and an individual query object with features y_j and feature value errors s_j:

p_k = \Sum over i from 1 to N_k  {  exp ^ -0.5*  \Sum over j from 1 to M   {  ( y_j - x_i,j,k )^2 / s_j^2   }  }

There could be a switch to normalise or not the p_k by dividing them by N_k. You would normalise if the number of members bore no relation to the relative prior probability of the class and was just a matter of representing it through a richer or coarser grid of members. If you normalise, you may provide priors P_k separately that could represent a known relative richness of the class.

In fact, you could supply with the class of N_k members an array of priors P_i,k that allow you to weight each discrete member of a class in different ways. This is sometimes needed when the class is represented in a non-random way (by e.g. providing better resolution of discrete members in some part of covered feature space then in others; often this weight is known and should be taken explicitly into account in an estimation/classification problem). It is then a matter of good/intuitive software design where you normalise p_k_new = p_k/N_k, or not. This can be done in different ways (let the user set the weights, or… we need to choose an intuitive default behaviour in the case where no weights are provided).

The K different p_k should be normalised such that   \Sum over k from 1 to K = 1.0

(this assumes intrinsically that the provided classes are a complete set of the physics of the problem, i.e. there are no unknowns). Real unknowns can be dealt with in another way (and some known unknowns can be dealt with by designating the known unknowns, i.e. objects with known features but unknown labels who have escaped the labelling process in a reproducible way, a separate class called “unlabelled class”, and this then is a fraction of the total p that corresponds to the residual random risk that an object belongs to either one or none of the labelled classes).

The problem of classes where the members have measurement errors on the features as well (so not just the query objects have measurement errors on their features) can be dealt with by reducing the measured errors of the query objects to   s_i,j^2 = s_j^2 — sx_i,j,k^2  but that is an extension we can talk about later. If the class has homogeneous errors sx_j,k = f(j,k) but not f(i), then it is easy to implement on the user side by adjusting the errors on the query objects.

nbgl commented 7 years ago

I’m not going to have time to do this, but it might eventually be worth it to compare methods.