dtcenter / MET

Model Evaluation Tools
https://dtcenter.org/community-code/model-evaluation-tools-met
Apache License 2.0
77 stars 24 forks source link

Feature #2781 Convert MET NetCDF point obs to Pandas DataFrame #2877

Closed georgemccabe closed 5 months ago

georgemccabe commented 5 months ago

Expected Differences

Pull Request Testing

On seneca, I created a test script that runs plot_point_obs both passing an input file directly into MET and passing that same file through a python embedding script that converts the data to a Pandas DataFrame, then passes it to MET.

Test directory: /d1/projects/METplus/METplus_Data/development/met_2781

To run the script:

cd /d1/projects/METplus/METplus_Data/development/met_2781
./run_test.sh

or just compare the output files output/raw_subset.png and output/pyembed_subset.png which should contain the same plot.

I also added a unit test to demonstrate the new Python Embedding example and confirmed that it runs successfully both on seneca and in GHA.

@DanielAdriaansen : confirm that the new logic works with your test data @hsoh-u : confirm that the new python logic is in the correct location and matches the format/standards of the rest of the python logic

Could consider adding an example of using the new script somewhere, but I am not sure where that would live.

There will be 1 additional output file generated:

dir1: /data/output/met_test_truth contains 1144 files dir2: /data/output/met_test_output contains 1145 files

ERROR: folder /data/output/met_test_truth missing 1 files python/ndas.20120409.t12z.prepbufr.tm00.nr_met_nc_to_pandas.ps

Pull Request Checklist

See the METplus Workflow for details.

DanielAdriaansen commented 5 months ago

I was able to test this on seneca. It takes 20s to read in this file:

DEBUG 1: Reading point observation file: PYTHON_NUMPY=pyembed_pandas_testing.py
           typ       sid              vld        lat         lon   elv   var     lvl         hgt qc         obs
0       ADPUPA     89571  20200824_113000 -68.580002   77.970001  18.0   HGT  1000.0  -151.19162  2 -151.000000
1       ADPUPA     89571  20200824_113000 -68.580002   77.970001  18.0  SPFH   977.0    18.02284  2    0.000329
2       ADPUPA     89571  20200824_113000 -68.580002   77.970001  18.0   TMP   977.0    18.02284  2  249.449997
3       ADPUPA     89571  20200824_113000 -68.580002   77.970001  18.0   HGT   977.0    18.02284  2   18.000000
4       ADPUPA     89571  20200824_113000 -68.580002   77.970001  18.0  SPFH   976.0 -9999.00000  2    0.000342
...        ...       ...              ...        ...         ...   ...   ...     ...         ... ..         ...
934843  SYNDAT  MA030044  20200824_120000  26.250000  127.500000    --  VGRD   500.0 -9999.00000  0    6.900000
934844  SYNDAT  MA030044  20200824_120000  26.250000  127.500000    --  UGRD   400.0 -9999.00000  0    7.900000
934845  SYNDAT  MA030044  20200824_120000  26.250000  127.500000    --  VGRD   400.0 -9999.00000  0    6.100000
934846  SYNDAT  MA030044  20200824_120000  26.250000  127.500000    --  UGRD   300.0 -9999.00000  0    4.700000
934847  SYNDAT  MA030044  20200824_120000  26.250000  127.500000    --  VGRD   300.0 -9999.00000  0    4.500000

[934848 rows x 11 columns]

which is close to 1M observations. Better running of PB2NC via config options could help speed this up.

I worked with the DataFrame a bit in Python and didn't observe any trouble.

I wonder if we need any documentation of this? Maybe in Appendix F? @JohnHalleyGotway thoughts?

DanielAdriaansen commented 5 months ago

I guess we have this section, which is empty: https://met.readthedocs.io/en/develop/Users_Guide/appendixF.html#met-python-package

I thought maybe convert_point_data() was documented there, but it is not. So maybe for now it's OK to leave this undocumented.

Maybe I will add a "to-do" item here: on #2414 to document the "MET Python Module".

georgemccabe commented 5 months ago

@hsoh-u , I talked with @DanielAdriaansen about these changes. Based on his feedback, I added an init function to nc_point_obs to take an input file path so it can be initialized without calling read_data(). I also changed the read_data() function to raise an exception instead of return a boolean for success.