euroargodev / publicQCforum

A public forum to talk about Quality Control of Argo measurements
GNU General Public License v3.0
14 stars 1 forks source link

Status of Machine Learning for Argo QC #6

Open gmaze opened 4 years ago

gmaze commented 4 years ago

I'd like to open a discussion thread to get the status of developments with regard to the use of Machine Learning techniques in Argo QC procedures.

Different groups may have started to explore this possibility and it would be constructive to get here the status of these efforts, to avoid duplicates and to get feedback.

This could include a description of:

gmaze commented 4 years ago

At Ifremer/LOPS, we've tried the following:

Target variables:

Alarm status (True, False) of the ISAS13 test against climatology for one PSAL measurement

Features:

A "patch" of variables from the same profile as the target as well as from profiles before and after (+/- 2). Variables used: TEMP, PSAL, SIG0 and PRES.

ML method

Random forest

Dataset used

Argo snapshot from 2016/02 and ISAS team QC logs.

Overall performance or difficulties encountered

gaelforget commented 4 years ago
  • The True/False alarms training set is highly in-balanced simply because the ISAS13 test against climatology is not an effective test and raises too many False alarms.

Not sure if that helps or if I totally understand but would it make sense to consider using several climatology products and e.g. counting the # of alarms (e.g. 0/6 vs 6/6) and setting a threshold? I used to do something like that in the MITprof QC for ECCO (I was using the min of cost functions if I recall).

gmaze commented 4 years ago

@gaelforget this is a good suggestion that we started to experiment as well: taking a final decision on the basis of several QC test outcomes. But the choice of acceptable distance to the climatology is as important as the climatology value itself. One would need an "optimization" approach where, based on the historical dataset, we would determine the best combination of distance/reference to detect bad data. This however points to another problem: namely that the distance beyond which a data would be declared "bad" is in practice dependent on the user application, this is particularly true for data assimilation where data need to somehow be compatible with the numerical ocean simulation by the model. This finally lead us to the fact that the best we could do would be to compute a goodness probability for the data, it would be up to the user to define a threshold.