Implement prepbufr raobs obs builder 185

randytpierce commented 3 months ago

This is the implementation of the prepbufr builder.

randytpierce commented 3 months ago

I recommend using 3.12 because I have done all my prepbufr testing on 3.12. I know that version works with xarray, NCEPLIBS-bufr, and netcdf all of which are needed for the ingest container.

On Fri, Aug 2, 2024 at 9:40 AM Ian McGinnis @.***> wrote:

@.**** commented on this pull request.

In pyproject.toml https://github.com/NOAA-GSL/VxIngest/pull/395#discussion_r1700519014:

-python = "^3.11" +python = "^3.12"

If we're updating to Python 3.12, we should update our dockerfile & CI too. Unless 3.12 is required for the RAOBS, I could see an argument for making that a separate change/PR to keep the number of changes down. That being said, I'd be in favor of updating to 3.12 if all of our dependencies work with it.

In pyproject.toml https://github.com/NOAA-GSL/VxIngest/pull/395#discussion_r1700521301:

@@ -25,6 +26,7 @@ pytest = "^8.1.1" types-pyyaml = "^6.0.12.20240311" ruff = "^0.3.5" coverage = "^7.4.4" +ipykernel = "^6.29.4"

Do we need the jupyter python kernel for this project? I know sometimes VSCode insists on including it but if it's not required, I'd rather remove it so we don't have to worry about updating it for the security scanners.

In pyproject.toml https://github.com/NOAA-GSL/VxIngest/pull/395#discussion_r1700522895:

pyyaml = "^6.0.1" xarray = "^2024.3.0" +pyarrow = "^16.1.0"

I wasn't seeing pyarrow imported anywhere, did I miss a use case? I know it can be useful for working with Parquet files & doing columnar operations on data in memory. So it could be interesting to evaluate at some point.

In src/vxingest/grib2_to_cb/grib_builder.py https://github.com/NOAA-GSL/VxIngest/pull/395#discussion_r1700530189:

spfh = []

specific_humidity = []

This is a great naming improvement. Thanks!

In src/vxingest/grib2_to_cb/grib_builder.py https://github.com/NOAA-GSL/VxIngest/pull/395#discussion_r1700542438:
     for station in self.domain_stations:
geo_index = get_geo_index( self.ds_translate_item_variables_map["fcst_valid_epoch"], station["geo"] ) x_gridpoint = station["geo"][geo_index]["x_gridpoint"] y_gridpoint = station["geo"][geo_index]["y_gridpoint"]

spfh.append((float)(self.interp_grid_box(values, y_gridpoint, x_gridpoint)))

return spfh

specific_humidity.append(

(float)(self.interp_grid_box(values, y_gridpoint, x_gridpoint))
We could lose an extra set of parentheses here to improve readability - the ones around float don't do anything. ⬇️ Suggested change

(float)(self.interp_grid_box(values, y_gridpoint, x_gridpoint))

float(self.interp_grid_box(values, y_gridpoint, x_gridpoint))

In src/vxingest/grib2_to_cb/grib_builder.py https://github.com/NOAA-GSL/VxIngest/pull/395#discussion_r1700554740:

+class RaobModelNativeBuilderV01(GribModelBuilderV01):

"""This is the builder for model data that is ingested from grib2 NATIVE levels files.

It is a concrete builder specifically for the model raob data that are organized based

on the models preset vertical levels. This varies quite a bit from model to model

and is dependent on the configuration set up before the model runs.

This builder is a subclass of the GribModelBuilderV01 class.

The primary differences in these two classes are the handlers that derive the pressure level.

The pressure level needs to be interpolated according to a specific algorithm.

Args:

load_spec (Object): The load spec used to init the parent

ingest_document (Object): the ingest document

number_stations (int, optional): the maximum number of stations to process (for debugging). Defaults to sys.maxsize.

More so I can follow along - do we ingest RAOB's from GRIB? I thought it was all sourced from PREPBUFR right now.

In src/vxingest/prepbufr_to_cb/README.md https://github.com/NOAA-GSL/VxIngest/pull/395#discussion_r1700564753:

+```text

To begin with, a PREPBUFR file does not always contain, within each single data subset, the data for an entire report! Instead, for reports which contain mass (i.e. temperature, moisture, etc.) as well as wind (i.e. direction and speed, U and V component, etc.) data values, such data values are stored within two separate but adjacent (within the overall file) data subsets, where each related subset, quite obviously, contains the same report time, location, station identification, etc. information as the other, but where the "mass" subset contains the pressures and/or height levels at which "mass" data values occur, while the corresponding "wind" subset contains the levels at which "wind" data values occur. While it is true that this may, in some cases, cause the same pressure and/or height level to appear in both subsets, this separation is nonetheless maintained for historical reasons peculiar to NCEP. At any rate, the below program will actually merge all of the data from both subsets into a single, unified report in such cases, so that the final decoded output is clearer and more intuitive. +```

Using a code block for this confused my for a minute if you were quoting a source document or if this was an aside, if it's a quote you can use Markdown's "quote" syntax (>) and do: ⬇️ Suggested change

-```text

To begin with, a PREPBUFR file does not always contain, within each single data subset, the data for an entire report! Instead, for reports which contain mass (i.e. temperature, moisture, etc.) as well as wind (i.e. direction and speed, U and V component, etc.) data values, such data values are stored within two separate but adjacent (within the overall file) data subsets, where each related subset, quite obviously, contains the same report time, location, station identification, etc. information as the other, but where the "mass" subset contains the pressures and/or height levels at which "mass" data values occur, while the corresponding "wind" subset contains the levels at which "wind" data values occur. While it is true that this may, in some cases, cause the same pressure and/or height level to appear in both subsets, this separation is nonetheless maintained for historical reasons peculiar to NCEP. At any rate, the below program will actually merge all of the data from both subsets into a single, unified report in such cases, so that the final decoded output is clearer and more intuitive. -``` +> To begin with, a PREPBUFR file does not always contain, within each single data subset, the data for an entire report! Instead, for reports which contain mass (i.e. temperature, moisture, etc.) as well as wind (i.e. direction and speed, U and V component, etc.) data values, such data values are stored within two separate but adjacent (within the overall file) data subsets, where each related subset, quite obviously, contains the same report time, location, station identification, etc. information as the other, but where the "mass" subset contains the pressures and/or height levels at which "mass" data values occur, while the corresponding "wind" subset contains the levels at which "wind" data values occur. While it is true that this may, in some cases, cause the same pressure and/or height level to appear in both subsets, this separation is nonetheless maintained for historical reasons peculiar to NCEP.

I also removed the last line since I originally took it to refer to our ingest when it's referring to Fortran code displayed on the website.

In src/vxingest/prepbufr_to_cb/README.md https://github.com/NOAA-GSL/VxIngest/pull/395#discussion_r1700569633:

+I'm only putting this here temporarily so that I don't lose it before it gets implemented. +RUC domain +RRFS North American domain +Great Lakes +Global (all lat/lon) +Tropics (-20 <= lat <= +20) +Southern Hemisphere (-80 <= lat < -20) +Northern Hemisphere (+20 < lat <= +80) +Arctic (lat >= +70) -- Might want to change this to lat >= 60N to match EMC? +Antarctic (lat <= -70) -- Might want to change this to lat <= 60S to match EMC? +Alaska +Hawaii +HRRR domain +Eastern HRRR domain +Western HRRR domain +CONUS +Eastern CONUS (lon <= 100W) +Western CONUS (lon <= 100W) +Northeastern CONUS +Southeastern CONUS +Central CONUS +Southern CONUS +Northwest CONUS +Southern Plain

Markdown treats single newlines as part of a paragraph so this renders weirdly. If this is still needed, the following will render better: ⬇️ Suggested change

-I'm only putting this here temporarily so that I don't lose it before it gets implemented. -RUC domain -RRFS North American domain -Great Lakes -Global (all lat/lon) -Tropics (-20 <= lat <= +20) -Southern Hemisphere (-80 <= lat < -20) -Northern Hemisphere (+20 < lat <= +80) -Arctic (lat >= +70) -- Might want to change this to lat >= 60N to match EMC? -Antarctic (lat <= -70) -- Might want to change this to lat <= 60S to match EMC? -Alaska -Hawaii -HRRR domain -Eastern HRRR domain -Western HRRR domain -CONUS -Eastern CONUS (lon <= 100W) -Western CONUS (lon <= 100W) -Northeastern CONUS -Southeastern CONUS -Central CONUS -Southern CONUS -Northwest CONUS -Southern Plain +I'm only putting this here temporarily so that I don't lose it before it gets implemented. + + RUC domain + RRFS North American domain + Great Lakes + Global (all lat/lon) + Tropics (-20 <= lat <= +20) + Southern Hemisphere (-80 <= lat < -20) + Northern Hemisphere (+20 < lat <= +80) + Arctic (lat >= +70) -- Might want to change this to lat >= 60N to match EMC? + Antarctic (lat <= -70) -- Might want to change this to lat <= 60S to match EMC? + Alaska + Hawaii + HRRR domain + Eastern HRRR domain + Western HRRR domain + CONUS + Eastern CONUS (lon <= 100W) + Western CONUS (lon <= 100W) + Northeastern CONUS + Southeastern CONUS + Central CONUS + Southern CONUS + Northwest CONUS +* Southern Plain

In src/vxingest/prepbufr_to_cb/README.md https://github.com/NOAA-GSL/VxIngest/pull/395#discussion_r1700571514:

+HRRR domain +Eastern HRRR domain +Western HRRR domain +CONUS +Eastern CONUS (lon <= 100W) +Western CONUS (lon <= 100W) +Northeastern CONUS +Southeastern CONUS +Central CONUS +Southern CONUS +Northwest CONUS +Southern Plain + +## Ingest template +The ingest template for prepbufr RAOBS is "MD:V01:RAOB:obs:ingest:prepbufr". +It follows the same small Domain Specific Language (DSL) that all ingest templates follow. This is the template portion...

For my knowledge - do we have the DSL documented somewhere? You've explained the syntax with *, &, and | but I keep forgetting it, so it'd be handy to have a reference. I was thinking it'd be nice to have a link to that reference here.

If not, I'll make an issue to document it in something like a docs/ingest-dsl.md file and then link to it where appropriate.

In src/vxingest/prepbufr_to_cb/README.md https://github.com/NOAA-GSL/VxIngest/pull/395#discussion_r1700578113:

+There are four sections of mappings. +1 header basic header data like lat, lon, and station name +2 q_marker quality data +3 obs_err observation error data +4 obs_data_120 observation MASS data +5 obs_data_220 observation WIND data

This also renders weirdly due to how Markdown handles newlines. If it's an ordered list, I'd recommend: ⬇️ Suggested change

-There are four sections of mappings. -1 header basic header data like lat, lon, and station name -2 q_marker quality data -3 obs_err observation error data -4 obs_data_120 observation MASS data -5 obs_data_220 observation WIND data +There are four sections of mappings. + +1. header basic header data like lat, lon, and station name +2. q_marker quality data +3. obs_err observation error data +4. obs_data_120 observation MASS data +5. obs_data_220 observation WIND data

Otherwise, you could do a code block or a table to preserve the formatting: ⬇️ Suggested change

-There are four sections of mappings. -1 header basic header data like lat, lon, and station name -2 q_marker quality data -3 obs_err observation error data -4 obs_data_120 observation MASS data -5 obs_data_220 observation WIND data +text +There are four sections of mappings. +1 header basic header data like lat, lon, and station name +2 q_marker quality data +3 obs_err observation error data +4 obs_data_120 observation MASS data +5 obs_data_220 observation WIND data +

On src/vxingest/prepbufr_to_cb/run_ingest_threads.py https://github.com/NOAA-GSL/VxIngest/pull/395#discussion_r1700591792:

A general note, we'll need to incorporate this into main.py.

We'll import the prepbufr_to_cb module here with an identifiable name like PrepbufrIngest if this is general to PREPBUFR files, or PrepbufrRaobIngest if we'll need to add a new module for other PREPBUFR data types: https://github.com/NOAA-GSL/VxIngest/blob/b43bb43838716d365eb19d76189d9ae40a4a395b/src/vxingest/main.py#L26-L35

And then, it's not my favorite approach, but we'll need to add an entry to our large switch statement, here: https://github.com/NOAA-GSL/VxIngest/blob/b43bb43838716d365eb19d76189d9ae40a4a395b/src/vxingest/main.py#L404-L467

We also may need to update the CLI flags if we have custom flags in the code here.

On src/vxingest/prepbufr_to_cb/prepbufr_builder.py https://github.com/NOAA-GSL/VxIngest/pull/395#discussion_r1700627964:

Generally, I'm concerned about how deeply nested parts (e.g. - interpolate_data and handle_document) of the prepbufr_to_cb module are. Deeply nested code has been shown to be difficult to reason about and debug and is mentioned in the Zen of Python - "Flat is better than nested." https://peps.python.org/pep-0020/ Typically, folks recommend keeping nesting to 2-3 levels deep. They deal with deeper nesting via guard clauses and extracting functions. If you'd like to refactor this, I'd find it really useful to pair with you on it. Selfishly, I'd love to better understand the ingest code & PREPBUFR data.

Some resources:

Google's Code Health Blog: Nesting guidance https://testing.googleblog.com/2017/06/code-health-reduce-nesting-reduce.html?m=1

Google's Code Health Blog: Simplifying Control flow https://testing.googleblog.com/2023/10/simplify-your-control-flows.html?m=1

Jeff Attwood on "Flattening arrow code" https://blog.codinghorror.com/flattening-arrow-code/

Two methods for refactoring deeply nested code https://shuhanmirza.medium.com/two-simple-methods-to-refactor-deeply-nested-code-78eb302bb0b4

In src/vxingest/prepbufr_to_cb/prepbufr_builder.py https://github.com/NOAA-GSL/VxIngest/pull/395#discussion_r1700943298:

I cannot process this station - there is no array of pressure data

del interpolated_data[station]

Do we want to remove the station from the data if it can't be processed or do we want to set it to None? del has its uses but it can be rare to see in Python so this caught my eye.

In src/vxingest/prepbufr_to_cb/prepbufr_builder.py https://github.com/NOAA-GSL/VxIngest/pull/395#discussion_r1700953920:

:return: the document_map

"""

try:

if len(self.same_time_rows) != 0:

self.handle_document()

return self.document_map

except Exception as _e:

logger.exception(

"%s get_document_map: Exception in get_document_map: %s",

self.class.name,

str(_e),

)

return None

named functions

def meterspersecond_to_milesperhour(self, params_dict):

Two thoughts:

Can we utilize Pint for these conversions?

If not, should we move these "data transformation" functions into a separate module so we can use them across the various ingest types?

Bigger picture, and more as a discussion point, I wonder if we should make more use of Pint within the new Ingest to help wrangle units. If we did want to do that, it'd be handled in a separate issue.

In tests/vxingest/README.md https://github.com/NOAA-GSL/VxIngest/pull/395#discussion_r1700958816:

@@ -40,9 +40,9 @@ Note that this currently (as of 1/2024) disables most of the tests.

Test data

-For now, you'll need test resources from: https://drive.google.com/drive/folders/18YY74S8w2S0knKQRN-QxZdnfRjKxDN69?usp=drive_link unpacked to /opt/data in order to run the test suite. +For now, you'll need test resources from: opt_data.tar.gz to '/opt/data' in order to run the test suite.

Looks like "unpacked" was added to the URL here and that made it invalid: ⬇️ Suggested change

-For now, you'll need test resources from: opt_data.tar.gz to '/opt/data' in order to run the test suite. +For now, you'll need test resources from: opt_data.tar.gz to '/opt/data' in order to run the test suite.

In tests/vxingest/prepbufr_to_cb/test_unit_prepbufr_builder.py https://github.com/NOAA-GSL/VxIngest/pull/395#discussion_r1700960636:

+def test_read_header(mock_header_bufr):

Create an instance of PrepbufrBuilder

builder = PrepbufrRaobsObsBuilderV01(

None,

{

"template": {"subset": "RAOB"},

"ingest_document_ids": {},

"file_type": "PREPBUFR",

"origin_type": "GDAS",

"mnemonic_mapping": hdr_template,

},

)

Call the read_header method with the mock bufr object

header_data = builder.read_data_from_bufr(mock_header_bufr, hdr_template)

Assert the expected values

assert header_data["station_id"] == "SID123"

assert header_data["lon"] == 45.679

assert header_data["lat"] == -123.457

assert header_data["obs-cycle_time"] == 0.5

assert header_data["station_type"] == 1

assert header_data["elevation"] == 100.0

assert header_data["report_type"] == 2

I like the way the data is mocked out here!

In tests/vxingest/prepbufr_to_cb/test_int_read_data_from_file.py https://github.com/NOAA-GSL/VxIngest/pull/395#discussion_r1700961520:

+def test_read_header():

queue_element = (

"/opt/data/prepbufr_to_cb/input_files/241011200.gdas.t12z.prepbufr.nr"

)

vx_ingest = setup_connection()

ingest_doc = vx_ingest.collection.get("MD:V01:RAOB:obs:ingest:prepbufr").content_as[

dict

]

template = ingest_doc["mnemonic_mapping"]

builder = PrepbufrRaobsObsBuilderV01(

None,

ingest_doc,

)

bufr = ncepbufr.open(queue_element)

bufr.advance()

assert bufr.msg_type == template["bufr_msg_type"], "Expected ADPUPA message type"

bufr.load_subset()

header = builder.read_data_from_bufr(bufr, template["header"])

bufr.close()

assert header is not None

assert header["station_id"] == "89571"

assert header["lon"] == 77.97

assert header["lat"] == -68.58

assert header["obs-cycle_time"] == -0.5

assert header["elevation"] == 18.0

assert header["data_dump_report_type"] == 11.0

assert header["report_type"] == 120

We may need to mark these as integration tests for CI with @pytest.mark.integration().

@pytest.mark.integration()def test_read_header(): ...

In src/vxingest/grib2_to_cb/grib_builder.py https://github.com/NOAA-GSL/VxIngest/pull/395#discussion_r1700763293:

+class RaobModelNativeBuilderV01(GribModelBuilderV01):

"""This is the builder for model data that is ingested from grib2 NATIVE levels files.

It is a concrete builder specifically for the model raob data that are organized based

on the models preset vertical levels. This varies quite a bit from model to model

and is dependent on the configuration set up before the model runs.

This builder is a subclass of the GribModelBuilderV01 class.

The primary differences in these two classes are the handlers that derive the pressure level.

The pressure level needs to be interpolated according to a specific algorithm.

Args:

load_spec (Object): The load spec used to init the parent

ingest_document (Object): the ingest document

number_stations (int, optional): the maximum number of stations to process (for debugging). Defaults to sys.maxsize.

We discussed this during an in-person meeting - these are placeholder classes for extracting model data to compare with the RAOB obs data.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-GSL/VxIngest/pull/395#pullrequestreview-2213375662, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGDVQPUCKDYDWFHRKSD7VJTZPOR7PAVCNFSM6AAAAABLZFRCMSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDEMJTGM3TKNRWGI . You are receiving this because you authored the thread.Message ID: @.***>

-- Randy Pierce

ian-noaa commented 3 months ago

An aside - it looks like there's an unresolved merge conflict with the Poetry lock file on this branch that needs to be resolved before CI will run. I'd probably:

Merge in the latest from main
Take all of main's changes for poetry.lock
Regenerate the poetry.lock file once main is integrated & push it up as a new change.

randytpierce commented 3 months ago

Thanks

On Fri, Aug 2, 2024 at 3:45 PM Ian McGinnis @.***> wrote:

@.**** commented on this pull request.

In src/vxingest/prepbufr_to_cb/README.md https://github.com/NOAA-GSL/VxIngest/pull/395#discussion_r1702349120:

+HRRR domain +Eastern HRRR domain +Western HRRR domain +CONUS +Eastern CONUS (lon <= 100W) +Western CONUS (lon <= 100W) +Northeastern CONUS +Southeastern CONUS +Central CONUS +Southern CONUS +Northwest CONUS +Southern Plain + +## Ingest template +The ingest template for prepbufr RAOBS is "MD:V01:RAOB:obs:ingest:prepbufr". +It follows the same small Domain Specific Language (DSL) that all ingest templates follow. This is the template portion...

Awesome! Thanks!

FWIW - you should be able to do a relative reference in markdown. So, you can do something like:

my link

to reference that section in particular.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-GSL/VxIngest/pull/395#discussion_r1702349120, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGDVQPWJEG6IORED4RVWU2TZPP4XFAVCNFSM6AAAAABLZFRCMSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDEMJWGQ2DSOBTGY . You are receiving this because you authored the thread.Message ID: @.***>

randytpierce commented 3 months ago

Ok, after I get the main.py finished.

On Fri, Aug 2, 2024 at 3:27 PM Ian McGinnis @.***> wrote:

An aside - it looks like there's an unresolved merge conflict with the Poetry lock file on this branch that needs to be resolved before CI will run. I'd probably:

Merge in the latest from main

Take all of main's changes for poetry.lock

Regenerate the poetry.lock file once main is integrated & push it up as a new change.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-GSL/VxIngest/pull/395#issuecomment-2266164392, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGDVQPRPIBGHGYIG7KFLXELZPP2ULAVCNFSM6AAAAABLZFRCMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRWGE3DIMZZGI . You are receiving this because you authored the thread.Message ID: @.***>

-- Randy Pierce

randytpierce commented 3 months ago

Ok, This is finished. Now to see if everything still works.

On Fri, Aug 2, 2024 at 3:27 PM Ian McGinnis @.***> wrote:

An aside - it looks like there's an unresolved merge conflict with the Poetry lock file on this branch that needs to be resolved before CI will run. I'd probably:

Merge in the latest from main

Take all of main's changes for poetry.lock

Regenerate the poetry.lock file once main is integrated & push it up as a new change.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-GSL/VxIngest/pull/395#issuecomment-2266164392, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGDVQPRPIBGHGYIG7KFLXELZPP2ULAVCNFSM6AAAAABLZFRCMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRWGE3DIMZZGI . You are receiving this because you authored the thread.Message ID: @.***>

-- Randy Pierce

randytpierce commented 2 months ago

I have resolved the lint issues but the tests are failing because CI cannot find the ncepbufr module. I realize that I need to log onto aws and build the x86 linux version, but I don't have access to an x86 MAC (mine is ARM) to build that version. Still when I build the new module I don't remember how the CI finds the ncepbufr module. That still has me slightly confused. I think I will also add an integration test that has main.py as its entry point so no one will make the mistake of not integrating a new builder into main.py again. randy randy

On Mon, Aug 5, 2024 at 9:27 AM Ian McGinnis @.***> wrote:

@.**** commented on this pull request.

I added a couple notes based on your emails from Friday. It looks like you've resolved most of the lint issues but let me know if you have questions about the linter!

In docker/ingest/Dockerfile https://github.com/NOAA-GSL/VxIngest/pull/395#discussion_r1704279792:

-FROM python:3.11-slim-bookworm AS builder +FROM python:3.12-slim-bookworm AS builder

We have a number of layers in this Dockerfile so this will also need to be done on Line 6 & Line 72.

In src/vxingest/prepbufr_to_cb/vx_ingest_manager.py https://github.com/NOAA-GSL/VxIngest/pull/395#discussion_r1704288590:

stmnt_mysql = f'select wmoid,press,z,t,dp,rh,wd,ws from ruc_ua_pb.RAOB where date = "{date}" and press = {level} and wmoid = "{station}";'

_mysql_db = mysql.connector.connect(

host=self.load_spec["_mysql_host"],

user=self.load_spec["_mysql_user"],

password=self.load_spec["_mysql_pwd"],

)

my_cursor = _mysql_db.cursor()

my_cursor.execute(stmnt_mysql)

my_result_final = my_cursor.fetchall()

I missed this on my initial walkthrough - do we need to connect to a MySQL database? I thought VxIngest was Couchbase-only so that we could move away from our old MySQL DB.

And - if we do need that data, is there any other possible way to get it? I'd really like to avoid making the new ingest dependent on the old ingest.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-GSL/VxIngest/pull/395#pullrequestreview-2219274056, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGDVQPRTMT5TWXKBEW67MD3ZP6KU3AVCNFSM6AAAAABLZFRCMSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDEMJZGI3TIMBVGY . You are receiving this because you authored the thread.Message ID: @.***>

-- Randy Pierce

randytpierce commented 2 months ago

Thanks! I'll try this out and let you know. I thought poetry followed the python markers. Where, BTW, did you find the reference to PEP 508 markers? randy

On Tue, Aug 6, 2024 at 10:38 AM Ian McGinnis @.***> wrote:

@.**** commented on this pull request.

In pyproject.toml https://github.com/NOAA-GSL/VxIngest/pull/395#discussion_r1705832659:

+ncepbufr = [

{ platform = "linux_x86_64", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.0.1-py312-none-linux_x86_64.whl" },

{ platform = "macosx_14_0_arm64", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.0.1-py312-none-macosx_14_0_arm64.whl" } +]

It looks like Poetry's platform field takes 1 of darwin, linux, or win32. To get at the architecture Poetry supports PEP 508 markers https://peps.python.org/pep-0508/#environment-markers. You'd do: marker = "platform_machine == 'arm64'" for ARM, for example.

I also bumped the version number to match the new nceplibs version in the recommendation below. ⬇️ Suggested change

-ncepbufr = [

{ platform = "linux_x86_64", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.0.1-py312-none-linux_x86_64.whl" },

{ platform = "macosx_14_0_arm64", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.0.1-py312-none-macosx_14_0_arm64.whl" } -] +ncepbufr = [

{ platform = "linux", markers = "platform_machine == 'x86_64'", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.1.0-py312-none-linux_x86_64.whl" },

{ platform = "darwin", markers = "platform_machine == 'arm64'", file = "./third_party/NCEPLIBS-bufr/wheel_dist/ncepbufr-12.1.0-py312-none-macosx_14_0_arm64.whl" } +]

— Reply to this email directly, view it on GitHub https://github.com/NOAA-GSL/VxIngest/pull/395#pullrequestreview-2221787421, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGDVQPQGAD75KPY4EV22DT3ZQD3X7AVCNFSM6AAAAABLZFRCMSVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDEMRRG44DONBSGE . You are receiving this because you authored the thread.Message ID: @.***>

-- Randy Pierce

ian-noaa commented 2 months ago

Certainly! Poetry references them in their Dependency Specification guide.

ian-noaa commented 2 months ago

Now that I'm back from leave - what's the status on this PR? Last I recall, we had discovered we needed to install nceplibs-bufr.

It'd be good to get this finished and merged before we get pulled away onto other issues.

randytpierce commented 2 months ago

I have discovered two additional discrepancies in the data results compared to the legacy system - which is based on Ming's data assimilation code. One had to do with q-markers and that has now been implemented. The other is more subtle. The other one has to do with multiple entries of type 120 and type 220 reports within a single gdas prepbufr file. This must be understood before we consider this complete.

On Wed, Aug 21, 2024 at 10:58 AM Ian McGinnis @.***> wrote:

Now that I'm back from leave - what's the status on this PR? Last I recall, we had discovered we needed to install nceplibs-bufr.

It'd be good to get this finished and merged before we get pulled away onto other issues.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-GSL/VxIngest/pull/395#issuecomment-2302556288, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGDVQPQL6KSDETMURQFT2B3ZSTBJVAVCNFSM6AAAAABLZFRCMSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBSGU2TMMRYHA . You are receiving this because you authored the thread.Message ID: @.***>

-- Randy Pierce

NOAA-GSL / VxIngest

Implement prepbufr raobs obs builder 185 #395

@.**** commented on this pull request.

Do we need the jupyter python kernel for this project? I know sometimes VSCode insists on including it but if it's not required, I'd rather remove it so we don't have to worry about updating it for the security scanners.

I wasn't seeing pyarrow imported anywhere, did I miss a use case? I know it can be useful for working with Parquet files & doing columnar operations on data in memory. So it could be interesting to evaluate at some point.

This is a great naming improvement. Thanks!

More so I can follow along - do we ingest RAOB's from GRIB? I thought it was all sourced from PREPBUFR right now.

I also removed the last line since I originally took it to refer to our ingest when it's referring to Fortran code displayed on the website.

If not, I'll make an issue to document it in something like a docs/ingest-dsl.md file and then link to it where appropriate.

We also may need to update the CLI flags if we have custom flags in the code here.

I cannot process this station - there is no array of pressure data

Do we want to remove the station from the data if it can't be processed or do we want to set it to None? del has its uses but it can be rare to see in Python so this caught my eye.

named functions

Bigger picture, and more as a discussion point, I wonder if we should make more use of Pint within the new Ingest to help wrangle units. If we did want to do that, it'd be handled in a separate issue.

Test data

Create an instance of PrepbufrBuilder

Call the read_header method with the mock bufr object

Assert the expected values

I like the way the data is mocked out here!

@.**** commented on this pull request.

I added a couple notes based on your emails from Friday. It looks like you've resolved most of the lint issues but let me know if you have questions about the linter!

We have a number of layers in this Dockerfile so this will also need to be done on Line 6 & Line 72.

@.**** commented on this pull request.