IQuOD / AutoQC

A testing suite for automatic quality control checks of subsurface ocean temperature observations

MIT License

29 stars 16 forks source link

Machine learning strategy (was 'Combinatorics intelligence') #60

Open bkatiemills opened 9 years ago

bkatiemills commented 9 years ago

The brute-force combinatorics examinations implemented in #38 are acceptable for small numbers of tests, but compute time will diverge very badly as the number of tests grows. We need a stronger strategy.

The numbers reported in #59 make the individual tests seem much too permissive on their own; one idea could be to look for tests that flag disjoint sets of profiles, and OR them all together.

bkatiemills commented 9 years ago

After looking into machine learning strategies in greater depth, this paper by Sadohara suggests that a support vector machine (SVM) might be the best choice for this problem. Sadohara's paper examines learning boolean functions (which is exactly what we have - whether a profile passed or failed each individual test are our boolean inputs, and whether the profile should ultimately pass or fail is the boolean output) by examining all possible conjunctions of boolean inputs (which is exactly what we were naively doing in #38, albeit in a much cruder fashion). Sadohara presents two SVM implementations that:

produce a lower misclassification rate than other leading techniques for the same amount of training data.
remain computationally efficient even when considering a large number of inputs (QC tests).

Furthermore, scikit-learn supports SVM with custom kernels, meaning this ought to be not very difficult to implement.

bkatiemills commented 8 years ago

cc @s-good @BecCowley @BoyerWOD

101 presents a detailed report on a first pass at using machine learning techniques to perform final classification. Executive summary:

the most effective techniques correctly flag about 55% of profiles in the full quota dataset that ought to be flagged.
corresponding false positive rates on datasets that should not be flagged are around 10%
execution time for the full quota dataset on a low-end machine are under a minute.

These results are preliminary and much more remains to be done to optimize them; comments and suggestions very welcome.

bkatiemills commented 8 years ago

One thing that comes to mind that will produce systematic biases in any final decision strategy is the nature of flagged profiles in the dataset trained on; for example, the machine learning techniques explored above offer only a small improvement over just using EN_background; if the quota sample overrepresents profiles flagged by one test (like EN_background), this will bias any strategy.

Put another way, absolutely none of quota fails the impossible_location test; we therefore can't learn anything about its performance from this dataset.

Is there a way to know exactly why a given profile in the quota dataset has been marked as bad?

s-good commented 8 years ago

Hi Bill, the information about why a profile has been rejected is available but probably not in the ASCII file that you have. There are some other data available that we can run on to try to uncover biases from the training dataset. It also might be useful to split up the Quota dataset and see if different parts give similar answers. We should include all these things in the discussions at the upcoming workshop.