Feature #2781 Convert MET NetCDF point obs to Pandas DataFrame

georgemccabe commented 5 months ago

Expected Differences

[X] Do these changes introduce new tools, command line arguments, or configuration file options? [No]
[X] Do these changes modify the structure of existing or add new output data types (e.g. statistic line types or NetCDF variables)? [No]

Pull Request Testing

[X] Describe testing already performed for these changes:

On seneca, I created a test script that runs plot_point_obs both passing an input file directly into MET and passing that same file through a python embedding script that converts the data to a Pandas DataFrame, then passes it to MET.

Test directory: /d1/projects/METplus/METplus_Data/development/met_2781

To run the script:

cd /d1/projects/METplus/METplus_Data/development/met_2781
./run_test.sh

or just compare the output files output/raw_subset.png and output/pyembed_subset.png which should contain the same plot.

I also added a unit test to demonstrate the new Python Embedding example and confirmed that it runs successfully both on seneca and in GHA.

[X] Recommend testing for the reviewer(s) to perform, including the location of input datasets, and any additional instructions:

@DanielAdriaansen : confirm that the new logic works with your test data @hsoh-u : confirm that the new python logic is in the correct location and matches the format/standards of the rest of the python logic

[X] Do these changes include sufficient documentation updates, ensuring that no errors or warnings exist in the build of the documentation? [Yes]

Could consider adding an example of using the new script somewhere, but I am not sure where that would live.

[X] Do these changes include sufficient testing updates? [Yes]
[X] Will this PR result in changes to the MET test suite? [Yes]

There will be 1 additional output file generated:

dir1: /data/output/met_test_truth contains 1144 files dir2: /data/output/met_test_output contains 1145 files

ERROR: folder /data/output/met_test_truth missing 1 files python/ndas.20120409.t12z.prepbufr.tm00.nr_met_nc_to_pandas.ps

[X] Will this PR result in changes to existing METplus Use Cases? [No]
If yes, create a new Update Truth METplus issue to describe them.
[X] Do these changes introduce new SonarQube findings? [No]
[X] Please complete this pull request review by 5/7/2024.

Pull Request Checklist

See the METplus Workflow for details.

[X] Review the source issue metadata (required labels, projects, and milestone).
[X] Complete the PR definition above.
[X] Ensure the PR title matches the feature or bugfix branch name.
[X] Define the PR metadata, as permissions allow. Select: Reviewer(s) and Development issue Select: Milestone as the version that will include these changes Select: Coordinated METplus-X.Y Support project for bugfix releases or MET-X.Y.Z Development project for official releases
[x] After submitting the PR, select the :gear: icon in the Development section of the right hand sidebar. Search for the issue that this PR will close and select it, if it is not already selected.
[ ] After the PR is approved, merge your changes. If permissions do not allow this, request that the reviewer do the merge.
[ ] Close the linked issue and delete your feature or bugfix branch from GitHub.

DanielAdriaansen commented 5 months ago

I was able to test this on seneca. It takes 20s to read in this file:

DEBUG 1: Reading point observation file: PYTHON_NUMPY=pyembed_pandas_testing.py
           typ       sid              vld        lat         lon   elv   var     lvl         hgt qc         obs
0       ADPUPA     89571  20200824_113000 -68.580002   77.970001  18.0   HGT  1000.0  -151.19162  2 -151.000000
1       ADPUPA     89571  20200824_113000 -68.580002   77.970001  18.0  SPFH   977.0    18.02284  2    0.000329
2       ADPUPA     89571  20200824_113000 -68.580002   77.970001  18.0   TMP   977.0    18.02284  2  249.449997
3       ADPUPA     89571  20200824_113000 -68.580002   77.970001  18.0   HGT   977.0    18.02284  2   18.000000
4       ADPUPA     89571  20200824_113000 -68.580002   77.970001  18.0  SPFH   976.0 -9999.00000  2    0.000342
...        ...       ...              ...        ...         ...   ...   ...     ...         ... ..         ...
934843  SYNDAT  MA030044  20200824_120000  26.250000  127.500000    --  VGRD   500.0 -9999.00000  0    6.900000
934844  SYNDAT  MA030044  20200824_120000  26.250000  127.500000    --  UGRD   400.0 -9999.00000  0    7.900000
934845  SYNDAT  MA030044  20200824_120000  26.250000  127.500000    --  VGRD   400.0 -9999.00000  0    6.100000
934846  SYNDAT  MA030044  20200824_120000  26.250000  127.500000    --  UGRD   300.0 -9999.00000  0    4.700000
934847  SYNDAT  MA030044  20200824_120000  26.250000  127.500000    --  VGRD   300.0 -9999.00000  0    4.500000

[934848 rows x 11 columns]

which is close to 1M observations. Better running of PB2NC via config options could help speed this up.

I worked with the DataFrame a bit in Python and didn't observe any trouble.

I wonder if we need any documentation of this? Maybe in Appendix F? @JohnHalleyGotway thoughts?

DanielAdriaansen commented 5 months ago

I guess we have this section, which is empty: https://met.readthedocs.io/en/develop/Users_Guide/appendixF.html#met-python-package

I thought maybe convert_point_data() was documented there, but it is not. So maybe for now it's OK to leave this undocumented.

Maybe I will add a "to-do" item here: on #2414 to document the "MET Python Module".

georgemccabe commented 5 months ago

@hsoh-u , I talked with @DanielAdriaansen about these changes. Based on his feedback, I added an init function to nc_point_obs to take an input file path so it can be initialized without calling read_data(). I also changed the read_data() function to raise an exception instead of return a boolean for success.

dtcenter / MET