Status of Machine Learning for Argo QC

euroargodev / publicQCforum

A public forum to talk about Quality Control of Argo measurements

GNU General Public License v3.0

14 stars 1 forks source link

Status of Machine Learning for Argo QC #6

Open gmaze opened 4 years ago

gmaze commented 4 years ago

I'd like to open a discussion thread to get the status of developments with regard to the use of Machine Learning techniques in Argo QC procedures.

Different groups may have started to explore this possibility and it would be constructive to get here the status of these efforts, to avoid duplicates and to get feedback.

This could include a description of:

the target variables (eg: QC flag for one TEMP measure, QC flag for one PSAL profile,...)
the choice of features, explanatory variables
the ML method (eg: random forest)
the dataset used
the overall performance or difficulties encountered
anything you think relevant wrt this topic

gmaze commented 4 years ago

At Ifremer/LOPS, we've tried the following:

Target variables:

Alarm status (True, False) of the ISAS13 test against climatology for one PSAL measurement

Features:

A "patch" of variables from the same profile as the target as well as from profiles before and after (+/- 2). Variables used: TEMP, PSAL, SIG0 and PRES.

ML method

Random forest

Dataset used

Argo snapshot from 2016/02 and ISAS team QC logs.

Overall performance or difficulties encountered

Performances not stable. We wanted to use a "balanced" training set with as many True as False samples. But because they are many more False than True samples, we need to sub-sample the False alarm set. Then we encounter the difficulty of selecting statistically "similar" sub-samples. Overall performances are highly sensible to this sub-sampling.
The True/False alarms training set is highly in-balanced simply because the ISAS13 test against climatology is not an effective test and raises too many False alarms.

gaelforget commented 4 years ago

The True/False alarms training set is highly in-balanced simply because the ISAS13 test against climatology is not an effective test and raises too many False alarms.

Not sure if that helps or if I totally understand but would it make sense to consider using several climatology products and e.g. counting the # of alarms (e.g. 0/6 vs 6/6) and setting a threshold? I used to do something like that in the MITprof QC for ECCO (I was using the min of cost functions if I recall).

gmaze commented 4 years ago

@gaelforget this is a good suggestion that we started to experiment as well: taking a final decision on the basis of several QC test outcomes. But the choice of acceptable distance to the climatology is as important as the climatology value itself. One would need an "optimization" approach where, based on the historical dataset, we would determine the best combination of distance/reference to detect bad data. This however points to another problem: namely that the distance beyond which a data would be declared "bad" is in practice dependent on the user application, this is particularly true for data assimilation where data need to somehow be compatible with the numerical ocean simulation by the model. This finally lead us to the fact that the best we could do would be to compute a goodness probability for the data, it would be up to the user to define a threshold.