Optimise scitkit-learn classifier

murraycutforth commented 9 months ago

I think our simple per-pixel classifier, trained automatically on SCL data, looks promising enough to be worth trying to optimise a bit further.

Some ideas to look into:

Balanced sampling between snow, cloud, and land pixels
Investigate different model types
Investigate which features are meaningful (Ian's experiment of just giving date, elevation, and slope aspect data would be interesting!) (Also try giving additional derived features, such as normalised difference snow index (NDSI)).
Find optimal balance between model size and dataset size
Automatically throw out images with grossly incorrect SCL masks. Here's an example:

An_Riabhachan_2018-08-16

Any ideas? The NDSI is high, but it's uniformly high as opposed to being more "spotty" in a correct example like the following. Maybe we can try to threshold on some kind of image uniformity metric to find these problem images. Ideally, we could even correct the label from snow to cloud, giving our classifier the chance to out-perform the SCL labels.

Coire_Domhain_2020-05-27

ipoole commented 9 months ago

Thanks for your thoughts Murray. I'm rather luke warm on putting further effort into improving the reverse engineering by ML of the SCL labels (amusing though that would be) without independent 'gold' GT to validate it against. Simon has commented that in his opinion some of the classifier's 'errors' against SCL are in fact correct. So what do we gain by creating a classifier which mirrors the SCL labelling to, say, 99.9%, SCL errors and all? I take the point about improved 10m resolution, but without 10m GT how do we know we are getting the 'interpolation' right? Am I missing something? Is it that the SCL labels are not universally available for locations we wish to analyse??

The GT collection task, via SageMaker, is currently nominally assigned to me, and I'm embarrassed not to have progressed it much. (Building a numeric simulation of a fermion field via Dirac is more fun that wrestling with AWS ;-). Do we agree that GT collection is in fact important? If so, I'd appreciate some help through discussion on how to proceed. Detailed discussion should happen on an appropriate thread. Hmm, I now realise there is no issue for the GT/SageMaker work - I'll correct that shortly.

Are we due a call to discuss such matters?

ipoole commented 9 months ago

Addressing some of Murray's points more directly (sorry for rather ignoring these):

Balanced sampling between snow, cloud, and land pixels

Yes this is important, particularly since the choice of proportions in the test set will influence apparent performance, regardless of the ML model used.

Investigate which features are meaningful (Ian's experiment of just giving date, elevation, and slope aspect data would be interesting!) (Also try giving additional derived features, such as normalised difference snow index (NDSI)).

Sorry Murray, please explain NDSI.

Automatically throw out images with grossly incorrect SCL masks....

That does seem to be important! Seems we need a simple tool to a) rank patches by likelihood of such gross error, b) display appropriately for review.

SimonFisher92 / Scottish_Snow

Optimise scitkit-learn classifier #34