dtcenter / MET

Model Evaluation Tools
https://dtcenter.org/community-code/model-evaluation-tools-met
Apache License 2.0
74 stars 22 forks source link

Bugfix 2867 point2grid qc flag #2890

Closed hsoh-u closed 1 month ago

hsoh-u commented 1 month ago

Expected Differences

The meaning of ADP QC values were changed (it was 3 for high, 2 for medium, and 1 for low). The baseline algorithm and the enterprise algorithm produce different QC values for high, medium, and low. MET reads QC values and meanings from the variable attribute and apply them to -qc options (where 0 is high, 1 is medium, and 2 is low).

The -qc options at the unittests were changed to -qc 0,1,2.

Pull Request Testing

An unit test is added

New GOES16 data with Enterprise algorithm:

/d1/personal/hsoh/git/pull_request/MET_bugfix_2867_point2grid_qc_flag/bin/point2grid  /d1/projects/MET/MET_test_data/unit_test/model_data/goes_16/OR_ABI-L2-AODC-M6_G16_s20241100001171_e20241100003544_c20241100006242.nc G212 goes_aod_smoke_adp_high2.nc -adp /d1/projects/MET/MET_test_data/unit_test/model_data/goes_16/OR_ABI-L2-ADPC-M6_G16_s20241100001171_e20241100003544_c20241100006361.nc -field 'name="AOD_Smoke"; level="*";' -v 4 -qc 0,1

==>

DEBUG 4: regrid_goes_variable() -> Count: actual: 6, missing: 2758918, non_missing: 652344
DEBUG 4:    Filtered: by QC: 0, by adp QC: 62127, by absent: 590193, total: 652320
DEBUG 4:    Range:  data: [-0.05000000075 - 4.999973297]  QC: [0 - 2]
DEBUG 4:    AOD QC: high=8092 medium=6910, low=47149, no_retrieval=0
DEBUG 4:    ADP QC: high=0 (4), medium=24 (87), low=62117 (62050), no_retrieval=10
DEBUG 4:    adjusted: high to medium=0, high to low=4, medium to low=63, total=67

Note: if only high quality is given with-qc 0, all data will be filtered out.

Old GOES16 data with Baseline algorithm:

/d1/personal/hsoh/git/pull_request/MET_bugfix_2867_point2grid_qc_flag/bin/point2grid  /d1/projects/MET/MET_test_data/unit_test/model_data/goes_16/OR_ABI-L2-AODC-M6_G16_s20192662141196_e20192662143569_c20192662145547.nc G212 goes_aod_smoke_adp_high2.nc -adp /d1/projects/MET/MET_test_data/unit_test/model_data/goes_16/OR_ABI-L2-ADPC-M6_G16_s20192662141196_e20192662143569_c20192662144526.nc -field 'name="AOD_Smoke"; level="*";' -v 4 -qc 0,1

==>

DEBUG 4: regrid_goes_variable() -> Count: actual: 121, missing: 1937596, non_missing: 1473666
DEBUG 4:    Filtered: by QC: 0, by adp QC: 86130, by absent: 1387116, total: 1473246
DEBUG 4:    Range:  data: [-0.05 - 4.99997]  QC: [0 - 2]
DEBUG 4:    AOD QC: high=222 medium=1938, low=84390, no_retrieval=0
DEBUG 4:    ADP QC: high=34 (634), medium=386 (9), low=86130 (85907), no_retrieval=0
DEBUG 4:    adjusted: high to medium=380, high to low=220, medium to low=3, total=603

Same with main V11.1 bugfix. More AOD files with Enterprise algorithm are at seneca:/d1/personal/hsoh/data/MET-2853/20240419

One new file and three different output files because 1) the logic to compute QC flags was changed and 2) the -qc option is changed.

Maybe. Many findings were resolved, but the exiting findings can be identified as new by the SonarQube server.

Pull Request Checklist

See the METplus Workflow for details.

JohnHalleyGotway commented 1 month ago

@hsoh-u I've been looking at the new test for added by this Pull Request and note that it takes much longer than a similar, existing ADP test:

TEST: point2grid_GOES_16_ADP                     - pass -   2.312 sec
TEST: point2grid_GOES_16_ADP_Enterprise_high     - pass -  29.124 sec

The runtime increases from around 2 seconds to around 30.

I realize that the input data differs, but not dramatically so. Both have the same X/Y dimensions:

    y = 1500 ;
    x = 2500 ;

The existing AOD data is short:

    short AOD(y, x) ;
        AOD:_FillValue = -1s ;

Whereas the new AOD data is unsigned-short:

    ushort AOD(y, x) ;
        AOD:_FillValue = 65535US ;

Do you have any idea why there's such a dramatic difference in runtime?

hsoh-u commented 1 month ago

I will take a look. There was no execution differences by running the same commands manually

time /d1/personal/hsoh/git/bugfixes/bugfix_2867_point2grid_qc_flag/MET/bin/point2grid \
    /d1/projects/MET/MET_test_data/unit_test/model_data/goes_16/OR_ABI-L2-AODC-M6_G16_s20241100001171_e20241100003544_c20241100006242.nc  \
    G212 \
    /d1/personal/hsoh/MET/test_output/bugfix_2867_point2grid_qc_flag/point2grid/point2grid_GOES_16_ADP_Enterprise_high.nc  \
    -field 'name="AOD_Smoke";  level="(*,*)";' \
    -adp /d1/projects/MET/MET_test_data/unit_test/model_data/goes_16/OR_ABI-L2-ADPC-M6_G16_s20241100001171_e20241100003544_c20241100006361.nc  \
    -qc 0,1 -method MAX       -v 1 > log_enterprise

real    0m1.834s
user    0m1.905s
sys     0m3.893s
time /d1/personal/hsoh/git/bugfixes/bugfix_2867_point2grid_qc_flag/MET/bin/point2grid  \
    /d1/projects/MET/MET_test_data/unit_test/model_data/goes_16/OR_ABI-L2-AODC-M6_G16_s20192662141196_e20192662143569_c20192662145547.nc \
    G212 \
    /d1/personal/hsoh/MET/test_output/bugfix_2867_point2grid_qc_flag/point2grid/point2grid_GOES_16_ADP.nc \
    -field 'name="AOD_Smoke";  level="(*,*)";' \
    -adp /d1/projects/MET/MET_test_data/unit_test/model_data/goes_16/OR_ABI-L2-ADPC-M6_G16_s20192662141196_e20192662143569_c20192662144526.nc \
    -qc 1,2 -method MAX       -v 1 > zzz_baseline

real    0m1.943s
user    0m1.965s
sys     0m3.972s
hsoh-u commented 1 month ago

I ran the unit test point2grid manually and got the same result.

TEST: point2grid_GOES_16_ADP                     - pass -   1.901 sec
TEST: point2grid_GOES_16_ADP_Enterprise_high     - pass -  29.006 sec

Here are execution time from the log file: actual runtime = 29 seconds ( from 17:55:17Z - 17:54:48Z

export MET_TMP_DIR='${MET_TEST_OUTPUT}/point2grid'
/d1/personal/hsoh/git/bugfixes/bugfix_2867_point2grid_qc_flag/MET/share/met/../../bin/point2grid \
      /d1/projects/MET/MET_test_data/unit_test/model_data/goes_16/OR_ABI-L2-AODC-M6_G16_s20241100001171_e20241100003544_c20241100006242.nc \
      G212 \
      /d1/personal/hsoh/MET/test_output/bugfix_2867_point2grid_qc_flag/point2grid/point2grid_GOES_16_ADP_Enterprise_high.nc \
      -field 'name="AOD_Smoke";  level="(*,*)";' \
      -adp /d1/projects/MET/MET_test_data/unit_test/model_data/goes_16/OR_ABI-L2-ADPC-M6_G16_s20241100001171_e20241100003544_c20241100006361.nc \
      -qc 0,1 -method MAX \
      -v 1
DEBUG 1: Start point2grid by hsoh(9895) at 2024-05-21 17:54:48Z  cmd: /d1/personal/hsoh/git/bugfixes/bugfix_2867_point2grid_qc_flag/MET/share/met/../../bin/point2grid /d1/projects/MET/MET_test_data/unit_tes
t/model_data/goes_16/OR_ABI-L2-AODC-M6_G16_s20241100001171_e20241100003544_c20241100006242.nc G212 /d1/personal/hsoh/MET/test_output/bugfix_2867_point2grid_qc_flag/point2grid/point2grid_GOES_16_ADP_Enterpri
se_high.nc -field name="AOD_Smoke";  level="(*,*)"; -adp /d1/projects/MET/MET_test_data/unit_test/model_data/goes_16/OR_ABI-L2-ADPC-M6_G16_s20241100001171_e20241100003544_c20241100006361.nc -qc 0,1 -method
MAX -v 1
DEBUG 1: Reading data file: /d1/projects/MET/MET_test_data/unit_test/model_data/goes_16/OR_ABI-L2-AODC-M6_G16_s20241100001171_e20241100003544_c20241100006242.nc
DEBUG 1: Writing output file: /d1/personal/hsoh/MET/test_output/bugfix_2867_point2grid_qc_flag/point2grid/point2grid_GOES_16_ADP_Enterprise_high.nc
DEBUG 1: Finish point2grid by hsoh(9895) at 2024-05-21 17:55:17Z
Opening /d1/personal/hsoh/MET/test_output/bugfix_2867_point2grid_qc_flag/point2grid/point2grid_GOES_16_ADP_Enterprise_high.nc
Checking AOD_Smoke ... OK
Checking t ... OK
Checking time_bounds ... OK
unset MET_TMP_DIR

I copied the command and ran it manually. It took 2 seconds (18:04:03Z - 18:04:01Z)

DEBUG 1: Start point2grid by hsoh(9895) at 2024-05-21 18:04:01Z  cmd: /d1/personal/hsoh/git/bugfixes/bugfix_2867_point2grid_qc_flag/MET/share/met/../../bin/point2grid /d1/projects/MET/MET_test_data/unit_test/model_data/goes_16/OR_ABI-L2-AODC-M6_G16_s20241100001171_e20241100003544_c20241100006242.nc G212 /d1/personal/hsoh/MET/test_output/bugfix_2867_point2grid_qc_flag/point2grid/point2grid_GOES_16_ADP_Enterprise_high.nc -field name="AOD_Smoke";  level="(*,*)"; -adp /d1/projects/MET/MET_test_data/unit_test/model_data/goes_16/OR_ABI-L2-ADPC-M6_G16_s20241100001171_e20241100003544_c20241100006361.nc -qc 0,1 -method MAX -v 1
DEBUG 1: Reading data file: /d1/projects/MET/MET_test_data/unit_test/model_data/goes_16/OR_ABI-L2-AODC-M6_G16_s20241100001171_e20241100003544_c20241100006242.nc
DEBUG 1: Writing output file: /d1/personal/hsoh/MET/test_output/bugfix_2867_point2grid_qc_flag/point2grid/point2grid_GOES_16_ADP_Enterprise_high.nc
DEBUG 1: Finish point2grid by hsoh(9895) at 2024-05-21 18:04:03Z

I need to find a way to duplicate the problem.

JohnHalleyGotway commented 1 month ago

@hsoh-u the big latency is caused by having MET_TMP_DIR set. Without it set, it takes ~ 2 seconds. With it set, it takes ~ 30 seconds. Can you please take a look to figure out why there's such a large difference?

I'll also note that if you set it to something that doesn't exist (export MET_TMP_DIR=/bad/path) or to which you don't have write permission (export MET_TMP_DIR=/home/jopatz), it segfaults:

DEBUG 1: Reading data file: /d1/projects/MET/MET_test_data/unit_test/model_data/goes_16/OR_ABI-L2-AODC-M6_G16_s20241100001171_e20241100003544_c20241100006242.nc
FATAL ERROR (SEGFAULT): Process 1867683 got signal 11 @ local time = 2024-05-22 20:13:29Z
FATAL ERROR (SEGFAULT): Look for a core file in /d1/projects/MET/MET_pull_requests/met-12.0.0/beta5/MET-bugfix_2867_point2grid_qc_flag/internal/test_unit
FATAL ERROR (SEGFAULT): Process command line: /d1/projects/MET/MET_pull_requests/met-12.0.0/beta5/MET-bugfix_2867_point2grid_qc_flag/internal/test_unit/../../share/met/../../bin/point2grid /d1/projects/MET/MET_test_data/unit_test/model_data/goes_16/OR_ABI-L2-AODC-M6_G16_s20241100001171_e20241100003544_c20241100006242.nc G212 /d1/projects/MET/MET_pull_requests/met-12.0.0/beta5/MET-bugfix_2867_point2grid_qc_flag/internal/test_unit/../../test_output/point2grid/point2grid_GOES_16_ADP_Enterprise_high.nc -field name="AOD_Smoke";  level="(*,*)"; -adp /d1/projects/MET/MET_test_data/unit_test/model_data/goes_16/OR_ABI-L2-ADPC-M6_G16_s20241100001171_e20241100003544_c20241100006361.nc -qc 0,1 -method MAX -v 1 
Segmentation fault

Ideally, it would handle this error condition more gracefully.

hsoh-u commented 1 month ago

This is one time slowness when a new MET_TMP_DIR is set or a new target grid is added. For the GOES data, point2grid generates the mapping to each target grid cell (point lat/lon list for each target grid cell) and saves the mappings to the NetCDF file at $MET_TMP_DIR. The next runs are fast by using the pre-gererated mapping. So this is not a bug.