chengsoonong / mclass-sky

Multiclass methods for astronomical data
BSD 3-Clause "New" or "Revised" License
9 stars 4 forks source link

Investigate differences between features #145

Closed chengsoonong closed 7 years ago

chengsoonong commented 7 years ago

Create new features that are pairwise differences between existing features.

OR prove that the same predictor can be found using the original features.

chengsoonong commented 7 years ago

I like your proof in b66878d. Could you humour me and also do the empirical experiment?

SDSS has u,g,r,i,z features. Compute an additional 4 features, u-g, g-r, r-i, i-z For both PSF and Petrosian values.

Based on your proof, a linear regressor with these features will give exactly the same performance as a linear regressor with the original u,g,r,i,z features.

nbgl commented 7 years ago

This is done in 2891244. SGD regression with linear features and no differences yields an R^2 of -5.46e25, whereas SGD regression with linear features and with differences gives an R^2 of -3.57e25. Whereas this may seem to be an improvement, both those values are awful. This is because linear features don’t work at all for this problem.

Inputing the differences into SGD with nonlinear features, they lower the R^2 from 0.570 to 0.546, which is well within margin of error.

nbgl commented 7 years ago

On SGD regression with nonlinear features and 1000 training and testing points each, the R^2 value without differences is 0.842, whereas the R^2 value with differences is 0.850.

On 1M training points and 500K testing points, the R^2 without differences is 0.901, whereas with differences it is 0.907.