chengsoonong / crowdastro

Cross-identification of radio objects and host galaxies by applying machine learning on crowdsourced training labels.
MIT License
13 stars 1 forks source link

Implement Yan et al. (2011) #82

Closed MatthewJA closed 8 years ago

MatthewJA commented 8 years ago

Yan et al. (2010) is the passive learning, multiple annotator model, and I've implemented it in 43b1d281976cccfe19700abc5c146c1262edfc96. I need to generalise it a little so it will work on arbitrary data sets, I need to write a prediction function and an evaluation function, and I need to test that it actually works.

(Update: It doesn't work)

MatthewJA commented 8 years ago

Still working on this — I can only get zero predictions output, so something is either wrong with the update equations or my code. I found an updated version of the update equations (doi:10.1007/s10994-013-5412-1) which has a few mistakes fixed, so I'll try that again. I might also try a different optimisation library; last time I was getting all zeros in an optimisation problem, I was hitting a bug in scipy.

MatthewJA commented 8 years ago

I have Yan et al. (2010) working, with some, uh, interesting quirks.

screen shot 2016-07-23 at 5 36 00 pm

(The quirks are that sometimes the output is negative or zero, and the scipy optimisation sometimes doesn't converge due to p(y | x, z) -> 0. This is really weird.)

MatthewJA commented 8 years ago

Okay, so. It seems that this is massively sensitive to initial conditions. This is problematic because Yan et al. don't suggest any way to initialise — I think they suggest setting the parameters to zero, but that's a (not very good) local minimum in general. A poor choice of initial parameters will cause the EM algorithm to diverge. I'm considering emailing the authors and seeing if they have a reference implementation so I can see how they handle this.

If annotators are right more than half the time, then a good way to initialise seems to be to take the majority vote and fit a logistic regression model σ(αx + β). Then set w = 0 and γ = 0 for all annotators (i.e. no noise). This still diverges sometimes but seems much better than any other initialisation I've experimented with.

Another possibility might be to use an optimisation algorithm other than LBFGS. I tried basinhopping but it was really slow. Depending on time constraints this week, I may try something from bio-inspired, since the course covered pretty broadly applicable global optimisation algorithms.

image

In the above, annotator 1 is not very good at this (or is actively malicious!). Annotator 2 is wrong 10% of the time. Annotator 3 is good for x > 0 and completely random for all other points. Annotator 4 is wrong 30% of the time.

MatthewJA commented 8 years ago

image

Boosting the dimensions drops the effectiveness of LR for finding a starting position, presumably by increasing the number of local minima.

chengsoonong commented 8 years ago

It is difficult to find a global minimum for EM in general, so I suggest you raise a new issue about doing EM better, and move on with this. A quick hack is to initialise 10 times randomly, and use the best result. LBFGS is pretty good already, and you should put all your ideas for optimising better into the issue description.

MatthewJA commented 8 years ago

Okay, sounds good. I'll make the modification that takes 2010 -> 2011, tidy up the code, try it on RGZ, and then move on.

MatthewJA commented 8 years ago

See #122 for some discussion of convergence. I'm now working on 2010 -> 2011.

MatthewJA commented 8 years ago

I've implemented the partially-observed y case. I think it works...? I'll now move on to the "active" part (querying for more labels).

image

MatthewJA commented 8 years ago

Currently trying to speed up the code so it scales to ATLAS. I've mostly written some regression tests so far, just to make sure I don't break anything.

MatthewJA commented 8 years ago

While speeding up code (which all works great now, incidentally) I've realised that the optimisation problem is maybe a little bit too hard, since the dimensionality scales with O(features * annotators) and there's 1193 annotators. I'm going to look at the paper to figure out if it's possible to make this work for large values of T.

Possibly relevant paper with sparse models.

This model is similar to the one specified in ref. 6 [Yan et al. 2010] with the exception that the γj coefficients do not depend on the expert. [...] One implicit assumption is that the influence of each feature is the same for all experts.

MatthewJA commented 8 years ago

Implemented in 2f54a42dddb6074fdbd14679933bfd1ccf6452c8 and run in a notebook in df21a33. Seems to work better without balancing classes.

Good enough for now and I'm out of ideas on how to make it any better. I need to think about whether it's tractable to run the algorithm with large T and sparse labels without the assumption that η_t = η for all t. I'd like to talk about that on Monday but I'll also think about it over the weekend.

MatthewJA commented 8 years ago

Here's two ideas I had.

Idea 1

Assume that w_t actually lie in a F-dimensional subspace of R^D s.t. F < D. Then there should be some matrix W that maps w'_t in R^F to w_t in R^D, the latter of which is compatible with the Yan model. Then, instead of finding all the w_t in R^D (D × T parameters), we instead need to find the lower-dimensional w'_t in R^F and the matrix W (F × (T + D) parameters). This is less parameters provided that F < (T × D)/(D + T).

I'm not sure if that assumption is even close to true, though, and I'm also not sure how to check it.

Idea 2

Assume that people tend to classify similarly to other people. Somehow cluster the labellers and use these clusters to partition the labellers. Then have just one weights vector w_t for each cluster t. T is now the number of clusters, resulting in a large reduction in problem size if T is large.

I'm also not sure if that assumption is true, nor how to cluster labellers.