IQuOD / AutoQC

A testing suite for automatic quality control checks of subsurface ocean temperature observations
MIT License
29 stars 16 forks source link

How are we doing? #146

Open bkatiemills opened 8 years ago

bkatiemills commented 8 years ago

Since Hamburg, the number of qctests implemented has more than doubled (thanks @s-good and @castelao!) - it might be interesting to run the complete stack on quota again, to see how much of an impact we've made on our 55% true positive rate since then. I'm happy to do this, but it also might be good for someone other than me to go through the procedure of deploying and running on AWS (or wherever), so more than one person is familiar with how to do it, and we can double check the instructions are complete and clear.

s-good commented 8 years ago

Good idea! I'm pretty busy here so don't know if I'll get chance in the near future, so it might be best for you to do it if you are happy with that.

bkatiemills commented 8 years ago

No problem! I want to finish the track check first (getting there finally), then this.

s-good commented 8 years ago

Hi @BillMills, I was thinking we should put together a list of things we want to achieve before May. I thought of:

  1. Add checks using real examples for the EN std level check.
  2. Extract/convert the datasets we are going to use for the testing.
  3. Implement the CSIRO tests.
  4. At a stretch, add the objective analysis test from CORA dataset.

I'll need to do the first one of course and the second should be in hand. I think the fourth one would be difficult to implement but could be possible with help from the CORA team. We should probably do everything else and then see if we have time to do it. How are the CSIRO tests going? Is there anything I can do to help?

bkatiemills commented 8 years ago

Hi @s-good - sounds like a plan to me! Also on the topic of running the full stack, I've got a test run going on AWS as we speak, but there are a couple issues:

s-good commented 8 years ago

One of the ICDC tests requires a new data file which is in the docker image but the rest should run as they are. Do you have the error message?

If it is going to cost that much on AWS I agree we should look into other options. I can try to run it at my work - it may take some time to get it all run but I think it should be possible.

s-good commented 8 years ago

Also, I think if it is running very slow we should not attempt the CORA test as I think it would be very resource heavy to run. We could consider, as a one off, asking them to run the profiles through their system and return the results to us. If it turns out to be very good we could implement it at a later date.

bkatiemills commented 8 years ago

A few updates:

bkatiemills commented 8 years ago

In 2-3 hours on AWS c3large:

Number of profiles tested was 12787

                       NAME OF TEST   FAILS     TPR     FPR     TNR     FNR
            Argo_global_range_check     102   16.0%    0.0%  100.0%   84.0%
                 Argo_gradient_test     186   27.4%    0.1%   99.9%   72.6%
          Argo_impossible_date_test       0    0.0%    0.0%  100.0%  100.0%
      Argo_impossible_location_test       0    0.0%    0.0%  100.0%  100.0%
      Argo_pressure_increasing_test    1696   11.6%   13.4%   86.6%   88.4%
           Argo_regional_range_test       5    0.6%    0.0%  100.0%   99.4%
                    Argo_spike_test      19    3.0%    0.0%  100.0%   97.0%
                        CSIRO_depth    9473   83.7%   73.6%   26.4%   16.3%
                     CSIRO_gradient    2036   43.8%   14.5%   85.5%   56.2%
                   CSIRO_wire_break     161   25.2%    0.0%  100.0%   74.8%
      CoTeDe_Argo_density_inversion    6028   34.9%   47.8%   52.2%   65.1%
          CoTeDe_GTSPP_WOA_normbias    2953   85.0%   19.8%   80.2%   15.0%
          CoTeDe_GTSPP_global_range     103   16.1%    0.0%  100.0%   83.9%
              CoTeDe_GTSPP_gradient     123   17.5%    0.1%   99.9%   82.5%
           CoTeDe_GTSPP_spike_check      24    3.3%    0.0%  100.0%   96.7%
                         CoTeDe_RoC     174   24.9%    0.1%   99.9%   75.1%
                CoTeDe_WOA_normbias    1328   61.2%    7.7%   92.3%   38.8%
           CoTeDe_anomaly_detection    2180   51.3%   15.2%   84.8%   48.7%
             CoTeDe_digit_roll_over     294   28.6%    0.9%   99.1%   71.4%
                 CoTeDe_fuzzy_logic    1826   74.5%   11.1%   88.9%   25.5%
                    CoTeDe_gradient     131   18.8%    0.1%   99.9%   81.2%
        CoTeDe_location_at_sea_test       6    0.5%    0.0%  100.0%   99.5%
             CoTeDe_profile_envelop     185   29.0%    0.0%  100.0%   71.0%
                       CoTeDe_spike      14    2.2%    0.0%  100.0%   97.8%
               CoTeDe_tukey53H_norm      68    6.6%    0.2%   99.8%   93.4%
      EN_background_available_check      15    1.9%    0.0%  100.0%   98.1%
                EN_background_check     931   43.8%    5.4%   94.6%   56.2%
            EN_constant_value_check       1    0.2%    0.0%  100.0%   99.8%
                     EN_range_check     102   16.0%    0.0%  100.0%   84.0%
            EN_spike_and_step_check      17    2.5%    0.0%  100.0%   97.5%
          EN_spike_and_step_suspect     210   29.1%    0.2%   99.8%   70.9%
                 EN_stability_check      28    1.3%    0.2%   99.8%   98.7%
                 WOD_gradient_check     266   30.0%    0.6%   99.4%   70.0%
                    WOD_range_check     192   29.6%    0.0%  100.0%   70.4%
              loose_location_at_sea       2    0.3%    0.0%  100.0%   99.7%
               RESULT OF OR OF ALL:   11395   98.4%   88.6%   11.4%    1.6%

Taking all those that had TPR > FNR, minus CSIRO_depth for its large FPR, and in about 30 minutes:

Number of profiles tested was 12787

                       NAME OF TEST   FAILS     TPR     FPR     TNR     FNR
          CoTeDe_GTSPP_WOA_normbias    2953   85.0%   19.8%   80.2%   15.0%
                CoTeDe_WOA_normbias    1328   61.2%    7.7%   92.3%   38.8%
           CoTeDe_anomaly_detection    2180   51.3%   15.2%   84.8%   48.7%
                 CoTeDe_fuzzy_logic    1826   74.5%   11.1%   88.9%   25.5%
               RESULT OF OR OF ALL:    4363   90.5%   31.2%   68.8%    9.5%

Or instead, taking all those with FPR < 1% and in <1h:

Number of profiles tested was 12787

                       NAME OF TEST   FAILS     TPR     FPR     TNR     FNR
            Argo_global_range_check     102   16.0%    0.0%  100.0%   84.0%
                 Argo_gradient_test     186   27.4%    0.1%   99.9%   72.6%
           Argo_regional_range_test       5    0.6%    0.0%  100.0%   99.4%
                    Argo_spike_test      19    3.0%    0.0%  100.0%   97.0%
                   CSIRO_wire_break     161   25.2%    0.0%  100.0%   74.8%
          CoTeDe_GTSPP_global_range     103   16.1%    0.0%  100.0%   83.9%
              CoTeDe_GTSPP_gradient     123   17.5%    0.1%   99.9%   82.5%
           CoTeDe_GTSPP_spike_check      24    3.3%    0.0%  100.0%   96.7%
                         CoTeDe_RoC     174   24.9%    0.1%   99.9%   75.1%
             CoTeDe_digit_roll_over     294   28.6%    0.9%   99.1%   71.4%
                    CoTeDe_gradient     131   18.8%    0.1%   99.9%   81.2%
        CoTeDe_location_at_sea_test       6    0.5%    0.0%  100.0%   99.5%
             CoTeDe_profile_envelop     185   29.0%    0.0%  100.0%   71.0%
                       CoTeDe_spike      14    2.2%    0.0%  100.0%   97.8%
               CoTeDe_tukey53H_norm      68    6.6%    0.2%   99.8%   93.4%
            EN_constant_value_check       1    0.2%    0.0%  100.0%   99.8%
                     EN_range_check     102   16.0%    0.0%  100.0%   84.0%
            EN_spike_and_step_check      17    2.5%    0.0%  100.0%   97.5%
          EN_spike_and_step_suspect     210   29.1%    0.2%   99.8%   70.9%
                 EN_stability_check      28    1.3%    0.2%   99.8%   98.7%
                 WOD_gradient_check     266   30.0%    0.6%   99.4%   70.0%
                    WOD_range_check     192   29.6%    0.0%  100.0%   70.4%
              loose_location_at_sea       2    0.3%    0.0%  100.0%   99.7%
               RESULT OF OR OF ALL:     474   36.9%    2.0%   98.0%   63.1%

As I think you can tell, this is me wandering through combinations fairly ad hoc - but this is looking quite strong compared to Hamburg, thanks to the CoTeDe work. More soon.

s-good commented 8 years ago

This is fantastic! Thanks for doing this.

I will also have a go at running it at work - I think if I break it up into small enough parallel chunks and save the output as I go rather than at the end it may work and it saves the money, even if it is only a small amount each time. I'll let you know how I get on.

bkatiemills commented 8 years ago

Now that we're mostly just crossing t's on #130, one thing that strikes me as pressing for May is making a final decision about what set of tests to use. @s-good, let me know if you have something in mind; otherwise, I will turn my attention to this next.

We should set some performance goals, which will likely look something like "minimum false positives while demanding < n% false negatives;" of course we want n = 0, but that's more of an asymptotic goal. The experts will need to set n for IQuOD 1.0, but if forced to do it myself, I would say something like n<5% this year, and shoot for n<2% next (if we plan on updating things in future, which is what I understood from Hamburg). But, not sure what numbers will be compelling for the community - let me know and we'll make it work.

bkatiemills commented 8 years ago

154 contains a quick application of a poll of a couple of machine learning strategies on a recent run over ~12000 quota profiles; current best is roughly:

bkatiemills commented 8 years ago

While hammering on that last 10% of bad profiles not getting flagged (absolutely baroque details here), I noticed that a whole lot of profiles in quota which report 0 for the depth of every level are not considered truly 'flagged' by AutoQC's referenceResults, which only considers the temperature qc t_level_qc. @s-good, is this what we want? Missing depth information seems (from my naive perspective) like a pretty serious problem that perhaps should affect how we judge whether a profile should have been flagged or not.

s-good commented 8 years ago

It sounds like we might need to be checking the depth QC flags as well as the temperature flags. Does it help to change line 81 in main.py from: refAssessment = profile.t_level_qc(originator=True) >= 3 to: refAssessment = profile.t_level_qc(originator=True) >= 3 | profile.z_level_qc(originator=True) >= 3

To answer your previous question, it would be good to try to find the best combination for a few cases e.g. lowest FPR if TPR > 90%, 95% and 98%. It might be interesting to also find highest TPR if FPR < 5% but I think that is a much lower priority.

It seems like we are pretty close to where we want to be with getting the QC checks done before running AutoQC on the different test datasets we have identified? If so I will ask about getting the test datasets extracted/converted into the WOD ASCII format.

bkatiemills commented 8 years ago
s-good commented 8 years ago

I'm a bit concerned by these profiles with depth of all 0 since they sound unusual and they are going to distort our statistics. @BecCowley, @BoyerWOD can you give any advice on profiles that @BillMills has found in QuOTA with depths of all 0? Are these real profiles in QuOTA or could this be an error in the translation to WOD format?

@BoyerWOD please could you also advise on which QC flags we need to read from the WOD ASCII files in order to get the QuOTA QC decisions? We are currently reading the originators QC flags attached to the temperature values. Do we also need to use the originator QC flags attached to the depths + anything else?

BecCowley commented 8 years ago

I have just looked at one of the profiles in question (88213533). It is a 1995 PALACE float (precursor to the Argo floats). The file in Quota has the depths (107 of them), so it appears to be a translation error to the WOD format. @BoyerWOD will need to investigate. Perhaps send him a list of the unique IDs will be most helpful.

Also, 754413 is a CTD (a low resolution version) and a duplicate of another profile. The depths also exist in the original file.

Have we considered creating a reading function for the original file format? Could be useful in the future for XBT data out of CSIRO.

s-good commented 8 years ago

Thanks @BecCowley. I think I've worked out what is going on now. I've listed the 88213533 profile using the Fortran code supplied on the WOD website and get this:

CC cruise Latitde Longitde YYYY MM DD Time Cast #levels 99 0 -34.620 92.310 1995 6 14 14.40 88213533 107

Number of variables in this cast: 2

 z  fo     1        fo   25        fo

0.0 01   15.482 (5) 01    3.000 (1) 01
0.0 11   15.482 (5) 01    9.000 (1) 01
0.0 11   15.482 (5) 01   15.000 (2) 01
0.0 11   15.482 (5) 01   21.000 (2) 01
0.0 11   15.482 (5) 01   27.000 (2) 01
0.0 11   15.482 (5) 01   33.000 (2) 01
0.0 11   15.433 (5) 01   39.000 (2) 01
0.0 11   15.383 (5) 01   45.000 (2) 01
0.0 11   15.383 (5) 01   51.000 (2) 01
0.0 11   15.383 (5) 01   57.000 (2) 01
0.0 11   15.383 (5) 01   63.000 (2) 01
0.0 11   15.383 (5) 01   69.000 (2) 01
0.0 11   15.383 (5) 01   75.000 (2) 01
0.0 11   15.383 (5) 01   81.000 (2) 01
0.0 11   15.309 (5) 01   87.000 (2) 01
0.0 11   15.211 (5) 01   93.000 (2) 01
0.0 11   14.331 (5) 01   99.000 (2) 01
0.0 11   13.512 (5) 01  105.000 (3) 01
0.0 11   13.179 (5) 01  111.000 (3) 01
0.0 11   13.203 (5) 01  117.000 (3) 01
0.0 11   13.155 (5) 01  123.000 (3) 01
0.0 11   12.989 (5) 01  129.000 (3) 01
0.0 11   12.847 (5) 01  135.000 (3) 01
0.0 11   12.706 (5) 01  141.000 (3) 01
0.0 11   12.612 (5) 01  147.000 (3) 01
0.0 11   12.517 (5) 01  153.000 (3) 01
0.0 11   12.447 (5) 01  159.000 (3) 01
0.0 11   12.353 (5) 01  165.000 (3) 01
0.0 11   12.213 (5) 01  172.000 (3) 01
0.0 11   12.119 (5) 01  178.000 (3) 01
0.0 11   12.026 (5) 01  184.000 (3) 01
0.0 11   11.956 (5) 01  190.000 (3) 01
0.0 11   11.933 (5) 01  196.000 (3) 01
0.0 11   11.910 (4) 01  202.000 (3) 01
0.0 11   11.840 (4) 01  208.000 (3) 01
0.0 11   11.770 (4) 01  214.000 (3) 01
0.0 11   11.701 (5) 01  220.000 (3) 01
0.0 11   11.654 (5) 01  226.000 (3) 01
0.0 11   11.608 (5) 01  232.000 (3) 01
0.0 11   11.562 (5) 01  238.000 (3) 01
0.0 11   11.562 (5) 01  244.000 (3) 01
0.0 11   11.492 (5) 01  250.000 (3) 01
0.0 11   11.469 (5) 01  256.000 (3) 01
0.0 11   11.446 (5) 01  262.000 (3) 01
0.0 11   11.377 (5) 01  268.000 (3) 01
0.0 11   11.377 (5) 01  274.000 (3) 01
0.0 11   11.331 (5) 01  280.000 (3) 01
0.0 11   11.285 (5) 01  286.000 (3) 01
0.0 11   11.285 (5) 01  292.000 (3) 01
0.0 11   11.262 (5) 01  298.000 (3) 01
0.0 11   11.192 (5) 01  304.000 (3) 01
0.0 11   11.192 (5) 01  310.000 (3) 01
0.0 11   11.192 (5) 01  316.000 (3) 01
0.0 11   11.192 (5) 01  322.000 (3) 01
0.0 11   11.123 (5) 01  328.000 (3) 01
0.0 11   11.100 (3) 01  334.000 (3) 01
0.0 11   11.100 (3) 01  340.000 (3) 01
0.0 11   11.100 (3) 01  346.000 (3) 01
0.0 11   11.032 (5) 01  352.000 (3) 01
0.0 11   11.009 (5) 01  358.000 (3) 01
0.0 11   11.009 (5) 01  364.000 (3) 01
0.0 11   10.940 (4) 01  370.000 (3) 01
0.0 11   10.917 (5) 01  376.000 (3) 01
0.0 11   10.917 (5) 01  382.000 (3) 01
0.0 11   10.894 (5) 01  388.000 (3) 01
0.0 11   10.871 (5) 01  394.000 (3) 01
0.0 11   10.825 (5) 01  400.000 (3) 01
0.0 11   10.757 (5) 01  406.000 (3) 01
0.0 11   10.734 (5) 01  418.000 (3) 01
0.0 11   10.643 (5) 01  430.000 (3) 01
0.0 11   10.574 (5) 01  442.000 (3) 01
0.0 11   10.551 (5) 01  454.000 (3) 01
0.0 11   10.460 (4) 01  466.000 (3) 01
0.0 11   10.369 (5) 01  478.000 (3) 01
0.0 11   10.301 (5) 01  490.000 (3) 01
0.0 11   10.278 (5) 01  502.000 (3) 01
0.0 11   10.188 (5) 01  514.000 (3) 01
0.0 11   10.097 (5) 01  526.000 (3) 01
0.0 11   10.006 (5) 01  538.000 (3) 01
0.0 11    9.961 (4) 01  550.000 (3) 01
0.0 11    9.916 (4) 01  562.000 (3) 01
0.0 11    9.826 (4) 01  574.000 (3) 01
0.0 11    9.781 (4) 01  586.000 (3) 01
0.0 11    9.668 (4) 01  598.000 (3) 01
0.0 11    9.555 (4) 01  610.000 (3) 01
0.0 11    9.533 (4) 01  622.000 (3) 01
0.0 11    9.466 (4) 01  634.000 (3) 01
0.0 11    9.376 (4) 01  647.000 (3) 01
0.0 11    9.286 (4) 01  659.000 (3) 01
0.0 11    9.219 (4) 01  671.000 (3) 01
0.0 11    9.107 (4) 01  683.000 (3) 01
0.0 11    8.996 (4) 01  695.000 (3) 01
0.0 11    8.862 (4) 01  707.000 (3) 01
0.0 11    8.751 (4) 01  719.000 (3) 01
0.0 11    8.573 (4) 01  731.000 (3) 01
0.0 11    8.484 (4) 01  743.000 (3) 01
0.0 11    8.373 (4) 01  755.000 (3) 01
0.0 11    8.218 (4) 01  767.000 (3) 01
0.0 11    8.086 (4) 01  779.000 (3) 01
0.0 11    7.932 (4) 01  791.000 (3) 01
0.0 11    7.712 (4) 01  809.000 (3) 01
0.0 11    7.427 (4) 01  827.000 (3) 01
0.0 11    7.077 (4) 01  845.000 (3) 01
0.0 11    6.728 (4) 01  863.000 (3) 01
0.0 11    6.490 (3) 01  881.000 (3) 01
0.0 11    6.295 (4) 01  899.000 (3) 01
0.0 11    6.035 (4) 01  917.000 (3) 01

VarFlag: 0 0

Secondary header # 1 0. (1) Secondary header # 29 4. (1) Secondary header # 96 3. (1) Secondary header # 99 2014248. (7)

Measured Variable # 1 Information Code # 5 4. (1)

The first column (depth) is all zeroes. Temperature (column with variable code 1) looks OK. There is also another variable present (code 25), which is pressure. So in this case there are pressure data but no depth data in the profile. The WOD QC flag for depth is 1 (fail) for all but the first level while the originator QC flag is 1 (pass) for all depths despite being all zeroes (I assume as it is using 1 for good data, 2 for probably good, 3 for probably bad, 4 for bad rather than the WOD QC codes). @BoyerWOD, is this as intended (and if so is it possible to either add them or use the originator QC flags to indicate to look for pressure rather than depth) or should the depth data be there?

@BecCowley, I was hoping to keep file format considerations separate from AutoQC to avoid complicating things + @BoyerWOD had already written file conversion scripts to convert all the test data to WOD ASCII before we started AutoQC so it seemed to make sense to use that. However, there was talk about supporting a netCDF specification at the last workshop.

bkatiemills commented 8 years ago

Thanks for looking into this, folks! Here's a list of profiles from my ascii-copy of quota that look like they have zero depth for all levels, per @BecCowley's request: zero_depth_profiles.txt

Let me know how we'd like to proceed re: defining our 'fail' flag. It might be possible to expand wodpy to support netcdf, but it's going to be very tight given our May deadline, and given that much of the time till then will be taken up by constructing and demonstrating a final decision algorithm.

BecCowley commented 8 years ago

@s-good @BillMills. I initially thought that the problem was the pressure record: the data in the original file is pressure, not depth. In our files we have either depth or pressure, but not both. It seems that the WOD format has capacity for both, but the conversion to depth is not made in the change from Mquest to WOD format.

However, going through the list you have provided shows that many of them are depth records and they are a mix of data types (BO, MB, CT, PF, XB, etc). So my conclusion is that I will send the list to Tim and ask him to review the problem.

For profiles with pressure only data: When we look at the profiles with pressure at CSIRO, we do a conversion on the fly to depth, rather than store it, so all pass flags are applied to depth, even though it is pressure that is in the file. Maybe you can do a conversion too? If depth is missing, look for pressure and do the conversion. It is a simple calculation using the teos-10 toolbox (http://www.teos-10.org/software.htm). The function is gsw_z_from_p. The toolbox is written in various languages, although python isn't one. This might be quicker and more useful than getting Tim to implement and re-create the dataset, although, he should be made aware of it.

Have you already implemented the teos-10 toolbox? If not, it will be very useful later.

Re: conversions from netcdf to WOD. That is fine, let's go with what we have available.

bkatiemills commented 8 years ago

Thanks, @BecCowley! There appears to be a pythonic gsw under development that includes the function you mention here - assuming this works as expected out of the box, this shouldn't (famous last words) be a showstopper. Could even do a flag in wodpy that replaces literal z with z(p) automatically - comments, @s-good?

s-good commented 8 years ago

Hi @BecCowley, I looked through a few of the profiles in the list (I couldn't find some of the profiles though so something may have gone wrong with what I was doing!) I found two types:

  1. Single level 'profiles' with their only level at 0 m.
  2. Multi level profiles with all 0 depths and with pressure included in the variables.

Did you find any multi level profiles with no pressure and all 0 depth?

I'm not sure how interested we are in type 1 profiles? These are just sea surface temperature measurements and most of our QC tests will not work on these. Do you think we should filter these out?

We are already importing gsw in AutoQC (I think it might be used by CoTeDe) so it would be no problem to do the conversion on the fly as suggested. However, I think it would be good to get advice from @BoyerWOD first. Actually we should probably consider doing the opposite conversion for the Argo QC tests as these are supposed to work on pressure. I said to ignore this before to reduce complexity but since it looks simple to do with gsw maybe we should add it?

BecCowley commented 8 years ago

@s-good. I agree, the two profile types are as you list. I have found 5 types in the original MQuest format using the list of profile numbers (total of 10748 profiles):

  1. Single level 'profiles' with their only level at 0 m (3344 profiles)
  2. Multi level profiles with all with pressure records (no depths) (1599 profiles)
  3. Multi level profiles with a surface only depth for TEMP and multiple depths for PSAL (26 profiles).
  4. Single level profiles with pressure records, no depth (1187 profiles)
  5. Single level profiles with depth records (4592 profiles).

I can understand the reasons types 1 to 4 have appeared. I do not know why type 5 has appeared. These should have depths recorded in the WOD version. @BoyerWOD will need to look at these specifically.

For type 1 and 3 (for temperature) the only tests likely to work are global range tests with these surface only data points.

I think we should utilize the gsw toolbox where necessary to switch between pressure and depth depending on the test requirements.

bkatiemills commented 8 years ago

Close as stale, or are there any outstanding concerns here before we wrap up v1?

s-good commented 8 years ago

Maybe keep this open for now to remind ourselves that these exist?