chengsoonong / crowdastro

Cross-identification of radio objects and host galaxies by applying machine learning on crowdsourced training labels.
MIT License
13 stars 1 forks source link

Feature extraction and active learning #99

Closed MatthewJA closed 8 years ago

MatthewJA commented 8 years ago

From messing around with the feature extraction step of the pipeline, I've found that the CNN training massively affects the final accuracy. This raises two points:

MatthewJA commented 8 years ago

One other idea could be to train the CNN in an unsupervised way, e.g. a CNN autoencoder. This would allow us to train on all the training data without biasing the features.

chengsoonong commented 8 years ago

I suggest training the CNN on ALL the data for now. Document this peeking in your report.

If there is time later in the project, we can consider the following (in order):

MatthewJA commented 8 years ago

Sounds good. Warm start CNN sounds like it could be a really good approach to take.

The CNN autoencoder you linked looks straightforward, too. I'll add this to milestone C to reconsider then.

MatthewJA commented 8 years ago

Let's revisit this some time, possibly tomorrow?

MatthewJA commented 8 years ago

Radio patches (left) and convolutional autoencoder reconstructions of the patches (right).

image

chengsoonong commented 8 years ago

Reconstruction a smidgin too smooth, but for our purposes, it looks great.

MatthewJA commented 8 years ago

Great! I'll rerun it a few times to try and nail down a decent network topology — I'd prefer less features than this provides, so I'll probably add another convolutional layer and maybe a dense layer.

MatthewJA commented 8 years ago

I think my boundary conditions break with more convolutional layers, so I'm going to see if I can find another implementation and use my newfound convolutional autoencoder knowledge to get it working on the data.

chengsoonong commented 8 years ago

Before you go down the route of finding features, visualise the IR and radio images of the positive examples that are classified negative by your predictor. 5-10 image patches from:

And for comparison, look at 5-10 patches where the score is >5.

At the same time, show the flux values (all other non-image features).

MatthewJA commented 8 years ago

Alright, I'll get that done. #140

MatthewJA commented 8 years ago

If you train logistic regression on the expert labels (100% accurate), you recover 85% balanced accuracy. If you train logistic regression on the crowd majority labels (85% accurate), you recover 85% balanced accuracy, too. This seems interesting! Maybe there's a maximum we're hitting.

I wonder if nonlinear and/or convolutional features would help.