JCSDA-internal / ioda-converters

Various converters for getting obs data in and out of IODA
9 stars 4 forks source link

Snow yaml creation and test for BUFR snocvr file #1273

Closed YoulongXia-NOAA closed 7 months ago

YoulongXia-NOAA commented 1 year ago

Description

NCEP/EMC obsproc team produced operational snow product named as gdas.tXXz.snocvr.tmXX.bufr_d (XX is 00, 06, 12, 18 Z cycle). The station-based includes some European countries and US and Canada mainly from MADIS. These bufr snow data are supporting the community land snow DA work. A yaml file needs to be created so that bufr2ioda.x can be used to create ioda v3 NetCDF4 file. The created data needs to be checked against original bufr data to ensure that the data conversion is correct. The ctest needs to be passed and yaml file need to be merged into ioda-converter develop branch.

Requirements

The yaml file needs to be created and tested.

Acceptance Criteria (Definition of Done)

Ctest will pass within ioda-bundle and yaml file will be merged into ioda-converter develop branch

Dependencies

No

YoulongXia-NOAA commented 1 year ago

On Hera, I am working new snow data to create yaml file, I put a input data in /scratch1/NCEPDEV/da/Youlong.Xia/test/test_fromRon/testinput - gdas.t06z.snocvr.tm00.bufr_d, I also use subset.x to reduce data size to small file - sizeReduced_gdas.t06z.snocvr.tm00.bufr_d. I create my yaml file to produce nc file in /scratch1/NCEPDEV/da/Youlong.Xia/test/test_fromRon/testrun - gdas.t06z.snocvr.tm00.nc and sizeReduced_gdas.t06z.snocvr.tm00.nc. I used ncdump to find that stationIdentification is a weird character in some places. I used debufr to check and found the message below. It shows all IDs are missing including WGOSLID Ilana suggested I use. I think that bufr2ioda.x cannot handle MISSING characters or other reasons. When I used ncdump to look and it shows me like this: stationIdentification = "570282", "▒▒▒▒▒▒▒▒", "▒▒▒▒▒▒▒▒", "▒▒▒▒▒▒▒▒", "▒▒▒▒▒▒▒▒", "▒▒▒▒▒▒▒▒", "613180", "273749", "06631", "▒▒▒▒▒▒▒▒",

The 10th record is below:

BUFR message # 10 of type NC000015 and date 2023050103 contains 1 subsets:

MESSAGE TYPE NC000015

001125 WGOSIDS MISSING NUMERIC WIGOS IDENTIFIER SERIES 001126 WGOSISID MISSING NUMERIC WIGOS ISSUER OF IDENTIFIER 001127 WGOSISNM MISSING NUMERIC WIGOS ISSUE NUMBER 001128 WGOSLID MISSING (16)CCITT IA5 WIGOS LOCAL IDENTIFIER (CHARACTER) 001101 STID 424.0 CODE TABLE State identifier 424 = United States of America (RA IV) 001102 NSID MISSING NUMERIC National station number 001019 LSTN BBSA3 ( 32)CCITT IA5 Long station or site

YoulongXia-NOAA commented 1 year ago

@barlage suggested to use LSTN after he look at the MADIS web viewer: https://madis-data.ncep.noaa.gov/MadisSurface/.

I tested hisr LSTN using bufr2ioda.x developed by Ron, the results show below:

stationIdentification = "Leukerba", "RNHW4", "MCKN5", "BBSW4", "LIZC2", "TPEM8", "Bosco Gu", "Ofenpass", "BERN / Z", "BBSA3", "COSW4", "MANM8", "TWNW4", "Malbun /", "Nendaz /", "PLAFFEIE", "Julier /", "SURU1", "DIAO3", "MULM8", "YOUW4", "Dotra /", "Malbun /", "Chaussy", "Grammont", "PRVT2", "KNXT2", "GLSI1", "RESC2", "HDLI1", "Puzzetta", "MWDO3", "ISAP4", "JVCI1", "BLTU1", "LMDO3", "TRGW1", "Arolla /", "Vals / P", "Stockhor", "Hinterrh", "Unterwas", "Planacha", "Flumserb", "Saanenmo", "DAVOS", "Gantrisc", "Ftan", "Plaun La", "Goschene", "Binn", "Trubsee", "Arolla", "Lauchere", "Oberwald", "TKVA2", "VNCM6", "GANA2", "MANH1", "Stockhor", "MHWO3", "RAWC2", "BKSI1", "LSMU1", "SUCW4", "Bedretto", "Nendaz /", "Ybrig /", "SGMC2", "CCKU1", "LWDI1", "TCMO3", "Arbaz, V", "Mesocco", "GRIMSEL", "La Fouly", "MTBU1", "CULC2", "MOQI1", "WCTA3", "Campolun", "Jaun / C", "Wassen /", "Airolo /", "PTNK2", "CTSM7", "FISW1", "PRSN2", "JNPC2", "Ofenpass", "SKZC2", "PERM6", "WSMU1", "BRBI1", "LMCN2", "TIPO3", "Jaun / C", "Simplon", "Davos /", "Fluela /", "Corvatsc", "Planacha", "Murren", "Saanenmo", "Bourg-St", "Gantrisc", "Ftan", "Plaun La", "Goschene", "Innergla", "Trubsee", "Ulrichen", "Elm / Mi", "Nara / M", "KTSA2", "FTRO2", "BSBM7", "WPLH1" ;

Some stations contain "/". The 9th bufr message I debufr it is below: 001019 LSTN BERN / ZOLLIKOFEN ( 32)CCITT IA5. This means that there is a station name as BERN or ZOLLIKOFEN, so that it contains "/". Bufr2ioda.x selects it as "BERN / Z". If @ilianagenkova agrees that I use LSTN, @rmclaren needs to modify bufr2ioda.x to only select the first one if "/" appears. All, if you have any further comments/suggestions, please let me know.

rmclaren commented 1 year ago

@YoulongXia-NOAA I would use the python API instead of the YAML file. This way you'd have total freedom to handle these strings in any way you want. Otherwise we'd be modifying core BUFR reading code every time a specific file has anything funny in it...

YoulongXia-NOAA commented 1 year ago

@BenjaminRuston @ilianagenkova, do you think I can use LSTN as stationIdentification? If YES, @rmclaren is suggesting me to use Python API rather than yaml file, how do you think if yaml file and bufr2ioda.x cannot handle "/", @BenjaminRuston?

BenjaminRuston commented 1 year ago

I'm fine with using the python API to handle these unique station names for this dataset

YoulongXia-NOAA commented 1 year ago

@rmclaren @BenjaminRuston, Okay, I will work on python API for this dataset using LSTN as the stationIdentification. Thank you.

rmclaren commented 1 year ago

Here's an example:

import bufr
import ioda_obs_space as ioda_ospace

DATA_PATH = './testinput/gdas.t00z.1bhrs4.tm00.bufr_d'
OUTPUT_PATH = './testrun/bufr_query_python_to_ioda_test.nc'

def test_bufr_to_ioda():

    # Make the QuerySet for all the data we want
    q = bufr.QuerySet()
    q.add('year', '*/YEAR')
    q.add('month', '*/MNTH')
    q.add('day', '*/DAYS')
    q.add('hour', '*/HOUR')
    q.add('minute', '*/MINU')
    q.add('second', '*/SECO')
    q.add('latitude', '*/CLON')
    q.add('longitude', '*/CLAT')
    q.add('radiance', '*/BRIT/TMBR')

    # Open the BUFR file and execute the QuerySet
    with bufr.File(DATA_PATH) as f:
        r = f.execute(q)

    # Use the ResultSet returned to get numpy arrays of the data
    lat = r.get('latitude')
    lon = r.get('longitude')
    rad = r.get('radiance')

    datetime = r.get_datetime('year', 'month', 'day', 'hour', 'minute', 'second')

    # Create the dimensions
    dims = {'Location': rad.shape[0],
            'Channel': rad.shape[1]}

    # Create the IODA ObsSpace
    obsspace = ioda_ospace.ObsSpace(OUTPUT_PATH, mode='w', dim_dict=dims)

    # Create the global attributes
    obsspace.write_attr('MyGlobal_str', 'My Global String Data')
    obsspace.write_attr('MyGlobal_int', [1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

    # Create the variables
    obsspace.create_var('MetaData/Timestamp', dtype=datetime.dtype, fillval=datetime.fill_value) \
        .write_attr('units', 'seconds_since_epoch') \
        .write_attr('long_name', 'Timestamp') \
        .write_data(datetime)

    obsspace.create_var('MetaData/Latitude', dtype=lat.dtype, fillval=lat.fill_value) \
        .write_attr('units', 'degrees_north') \
        .write_attr('long_name', 'Latitude') \
        .write_attr('valid_range', [-90.0, 90.0]) \
        .write_data(lat)

    obsspace.create_var('MetaData/Longitude', dtype=lon.dtype, fillval=lon.fill_value) \
        .write_attr('units', 'degrees_east') \
        .write_attr('long_name', 'Longitude') \
        .write_attr('valid_range', [-180.0, 180.0]) \
        .write_data(lon)

    obsspace.create_var('ObsValue/brightnessTemperature',  dtype=rad.dtype,
                        dim_list=['Location', 'Channel'], fillval=rad.fill_value) \
        .write_attr('units', 'K') \
        .write_attr('long_name', 'ATMS Observed (Uncorrected) Brightness Temperature') \
        .write_attr('valid_range', [100.0, 500.0]) \
        .write_data(rad)
YoulongXia-NOAA commented 1 year ago

Thank you, @rmclaren. I am referring Nick's example but it is for prebufr file. Your example may be more appropriate for me as a reference.

rmclaren commented 1 year ago

Please use import ioda_obs_space as ioda_ospace instead of ioda directly as this will vastly simplify your code... I'm updating the example ioda-converters (bufr_query_python_to_ioda_test.py) to use this as well.

YoulongXia-NOAA commented 1 year ago

@rmclaren, my bufr file has no second variable. It just includes year, month. days, hour, and minute, can I not use second or do I need to set second = 00 or something else? Thank you.

rmclaren commented 1 year ago

just leave off the seconds. should be optional

YoulongXia-NOAA commented 1 year ago

@rmclaren, on Hera, what python environment do you use? I used the python environment below:

setenv IODA_BUILD_DIR /scratch2/NCEPDEV/stmp3/Youlong.Xia/snowIodaBundle/ioda-bundle/build setenv IODA_CONV_DIR /scratch2/NCEPDEV/stmp3/Youlong.Xia/snowIodaBundle/ioda-bundle/iodaconv/src setenv python /scratch1/NCEPDEV/da/python/opt/core/miniconda3/4.6.14/envs/gdasapp/bin/python3.7 setenv IODAPY $IODA_BUILD_DIR/lib/python3.7/pyioda setenv BUFR $IODA_BUILD_DIR/lib/pyiodaconv setenv PYTHONPATH ${PYTHONPATH}:${IODA_CONV_DIR}:${IODAPY}:${BUFR}

But look it does not work. Thank you.

rmclaren commented 1 year ago

You might want to change the datetime section to be like the following:

    obsspace.create_var('MetaData/Timestamp', dtype=np.int64, fillval=datetime.fill_value.astype('int64')) \
        .write_attr('units', 'seconds_since_epoch') \
        .write_attr('long_name', 'Timestamp') \
        .write_data(datetime.astype('int64'))

Do you see any error messages after you run the updated code?

YoulongXia-NOAA commented 1 year ago

Yes, when I changed as suggested, and also add import numpy as np, it works. And then I used ncdump to output it as nc_test.out in /scratch1/NCEPDEV/da/Youlong.Xia/test/pythonAPI/testrun. I used diff to compare two data files nc_test.out and ymal_nc.out (in /scratch1/NCEPDEV/da/Youlong.Xia/test/test_fromRon/testrun) and found that only some differences for strings: Location = UNLIMITED ; // (118 currently)

  Location = 118 ;

5c5 < int64 Location(Location) ;

  int Location(Location) ;

11,12d10 < string :MyGlobal_str = "My Global String Data" ; < :MyGlobal_int = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ; 24,25c22 < dateTime:_FillValue = -9223372036854775806LL ; < string dateTime:units = "seconds since 1970-01-01T00:00:00Z" ;

          dateTime:_FillValue = 9223372036854775807LL ;

26a24 string dateTime:units = "seconds since 1970-01-01T00:00:00Z" ; 29d26 < string latitude:units = "degrees_north" ; 30a28 string latitude:units = "degrees_north" ; 34d31 < string longitude:units = "degrees_east" ; 35a33 string longitude:units = "degrees_east" ; 39d36 < string stationElevation:units = "m" ; 40a38,39 string stationElevation:units = "m" ; string stationElevation:coordinates = "longitude latitude" ; 43c42 < string stationIdentification:long_name = "Long Station or Site N ame" ;

          string stationIdentification:long_name = "WIGOS Local Identifier

" ; 142d140 < string totalSnowDepth:units = "m" ; 143a142,143 string totalSnowDepth:units = "m" ; string totalSnowDepth:coordinates = "longitude latitude" ;

But the data are same. This means "BERN / Z" still exist for python API test.

rmclaren commented 1 year ago

So now that you have those strings in a python numpy array you can manipulate them any way you want..

rmclaren commented 1 year ago

For example you could do this:

sid = np.array([id.split('/')[0] for id in sid])  # get the part before the /
YoulongXia-NOAA commented 1 year ago

before this sentence, I print: print(sid.dtype) print(sid.fill_value)

it shows below: object

After this sentence, I print: <U8 Traceback (most recent call last): File "bufr_ncep_snocvr_api.py", line 92, in snow_bufr_to_ioda() File "bufr_ncep_snocvr_api.py", line 43, in snow_bufr_to_ioda print(sid.fill_value)

I also print sid after this sentence, "/" is indeed removed but variable but variable characteristics seems to change.

YoulongXia-NOAA commented 1 year ago

@rmclaren, it looks to lead to the errors after I add your sentence although it indeed remove "/".

rmclaren commented 1 year ago

oops, ya that is not a masked array. So actually you might want to do this.

At the top:

import numpy as np

Then do this instead:

sid = np.ma.array([id.split('/')[0] for id in sid], mask=sid.mask, fill_value=sid.fill_value) 
YoulongXia-NOAA commented 1 year ago

@rmclaren, after I do that, and the it leads the error below: Traceback (most recent call last): File "bufr_ncep_snocvr_api.py", line 93, in snow_bufr_to_ioda() File "bufr_ncep_snocvr_api.py", line 81, in snow_bufr_to_ioda obsspace.create_var('MetaData/stationIdentification', dtype=sid.dtype, fillval=sid.fill_value) \ File "/scratch2/NCEPDEV/stmp3/Youlong.Xia/snowIodaBundle/ioda-bundle/build/lib/python3.7/pyioda/ioda_obs_space/init.py", line 276, in create_var params=fparams) TypeError: create(): incompatible function arguments. The following argument types are supported:

  1. (self: ioda._ioda_python.Has_Variables, name: str, dtype: ioda._ioda_python.Types, dimsCur: List[int] = [1], dimsMax: List[int] = [], scales: List[ioda._ioda_python.Variable] = [], params: ioda._ioda_python.VariableCreationParameters = <ioda._ioda_python.VariableCreationParameters object at 0x7fd52dcd7670>) -> ioda._ioda_python.Variable
  2. (self: ioda._ioda_python.Has_Variables, name: str, dtype: ioda._ioda_python.Type, dimsCur: List[int] = [1], dimsMax: List[int] = [], params: ioda._ioda_python.VariableCreationParameters = <ioda._ioda_python.VariableCreationParameters object at 0x7fd52dcd76f0>) -> ioda._ioda_python.Variable

Invoked with: <ioda.Has_Variables: [ Location ]>, 'MetaData/stationIdentification', None; kwargs: scales=[<ioda._ioda_python.Variable object at 0x7fd52dce1db0>], params=<ioda._ioda_python.VariableCreationParameters object at 0x7fd52dce1e70>

YoulongXia-NOAA commented 1 year ago

@rmclaren it seems to be when I use it directly, it works. I checked print(sid.dtype) and it is object. After I add that sentence, the sid.dtype becomes <U8, which is 8-characters string. The code does not allow to do something in python code??

rmclaren commented 1 year ago

I'm working on some deficiencies in recognizing strings properly in the ioda code (the current implementation is not very generic). Try this:

sid = sid.astype('S20') 

I'm hoping this will result in a type of string that will be recognized properly..

YoulongXia-NOAA commented 1 year ago

@rmclaren, YES, IT WORKS. Thank you. This issue has been solved. I am following the suggestions from obsproc team scientists, Jeff and Iliana, read all four WGOS components and make a WSI such as 0-20000-0-82403 => WGOSIDS-WGOSISID-WGOSISNM -WGOSLID although I cannot read the last one. Mitch told us that it is 16-byte character string. I am wondering if the python API can only read 8-byte character string. I can start to look how to combine these four strings into a single string and then output it.

rmclaren commented 1 year ago

I'll have to look into... I don't beleive I support reading strings that long right now.. limited to 8 byte strings I believe. Not sure how they encode 16-byte strings in BUFR.

rmclaren commented 1 year ago

Write an issue please

YoulongXia-NOAA commented 1 year ago

Will do soon and let you know. Thank you @BenjaminRuston

YoulongXia-NOAA commented 1 year ago

Oops, Thank you @rmclaren :)

BenjaminRuston commented 1 year ago

@YoulongXia-NOAA think this work is underway, or complete these are files on the GTS but not NOMADS correct, would be good to get a few full volume samples as well and these started Nov2022 or something recent as well correct

YoulongXia-NOAA commented 1 year ago

@BenjaminRuston, this is underway. As python API cannot handle long string WGOSLID well (Ron is working on that), and I want to use LSTN instead to represents stationIdentification, the users want to use the four WIGOS components as station Identification instead. We still need to consider the name definitions for the four WIGOS components. I will keep you updated.

YoulongXia-NOAA commented 1 year ago

@BenjaminRuston, I used 1 May 2023 data produced from EMC obsproc team as it just was produced from April 2023.

YoulongXia-NOAA commented 1 year ago

@BenjaminRuston, full volume samples are in /scratch2/NCEPDEV/marineda/Youlong.Xia/save/bufr_snowCover_GTS/gdas.20230501 and /scratch2/NCEPDEV/marineda/Youlong.Xia/save/bufr_snowCover_GTS/gdas.20230502 on Hera. Each day has 4 cycle and each cycle has a file. They are uploaded from wcoss2 obsproc directory to hera.

BenjaminRuston commented 11 months ago

@gibbsp did YoulongXia get removed?

BenjaminRuston commented 11 months ago

@YoulongXia-NOAA there you are... these files aren't available from NOMADS correct and they began around Nov2022 is that correct?

yes should try a few files and get this moved forward,, the data should be in the ADPSFC file soon is the idea correct?

YoulongXia-NOAA commented 11 months ago

@BenjaminRuston, from Nov20022, it includes only several European countries. From the end of April 2023, it includes both MADIS and those European countries. MADIS includes US and Canada stations. For some bufr data variable names, it seems to change. I have put many example data files in /scratch2/NCEPDEV/marineda/Youlong.Xia/save/bufr_snowCover_GTS/ on hera for various status when the data are requested to check. It includes several data file examples for May 2023. I can get the ops data from WCOSS2. See https://nomads.ncep.noaa.gov/pub/data/nccf/com/obsproc/prod/gdas.20231001/, e.g., gdas.t00z.snocvr.tm00.bufr_d. It is different from gdas.t00z.1bmhs.tm00.bufr_d, which contains GTS snow depth. I am not sure if obsproc team will combine them to one file or our ioda-converter needs to combine two files into single one. Now they will be process separately.

BenjaminRuston commented 11 months ago

great thanks for the information following except think you mean: gdas.t00z.adpsfc.tm00.bufr_d.nr

and then the other is the *.snocvr.* thought I heard from someone maybe @ClaraDraper-NOAA that these would be merged into a single file and into the adpsfc most likely at some point

ClaraDraper-NOAA commented 11 months ago

@BenjaminRuston These files are all EMC-related, so it wasn't me that told you that.

YoulongXia-NOAA commented 11 months ago

@BenjaminRuston @ClaraDraper-NOAA, as @ilianagenkova is in this list and she is an obsproc team lead, she can give us any update if these two bufr files related to snow depth will be combined or keep them separately. As snow depth is one of many variables in adpsfc bufr file, I am not sure if they can easily be merged in. But for now, I process them as two files separately. If you want to combine them into a ioda file in the future, python API can do that.

BenjaminRuston commented 11 months ago

yeah would agree, we do have some ability to detect duplicates across observation spaces this allows to leave them separate and maybe prioritize one over the other

BenjaminRuston commented 11 months ago

@YoulongXia-NOAA and @ClaraDraper-NOAA we should add a ctest for this data type?

has someone already started this if not it looks striaghtforward there's already a yaml to test was there a PR for this

ClaraDraper-NOAA commented 11 months ago

@YoulongXia-NOAA I'm not sure which data set this is.

  1. Is this the snow cover / snow depth that we currently assimilate operationally? or.
  2. Is it station snow depth observations (the obs that Jiarui is assimilating)?
BenjaminRuston commented 11 months ago

I thought @jiaruidong2017 has this in,, it's a new data feed of station snow depth that comes in a separate BUFR file from the ones contained in the ADPSFC. And example is this:

https://ftpprd.ncep.noaa.gov/data/nccf/com/obsproc/prod/gdas.20231010/gdas.t00z.snocvr.tm00.bufr_d

it's supposedly on GTS it's not being archived in the NCAR RDA which is inconvenient for me, and it only started I always forget Nov2022 or maybe sprint 2023 or something like that

YoulongXia-NOAA commented 11 months ago

@BenjaminRuston @ClaraDraper-NOAA, it seems not ready for a test. The reason is that this bufr data is missing RPID for US data but use LSTN instead. When yaml file uses LSTN, the station ID will contains some "/" or ",". Certainly I can use python API to cut string to keep the first part before "/" or ",". But I am not sure if this is what you want. As I did not get any guidance for such a case, I did not to create PR and test. Do you have any suggestions?

YoulongXia-NOAA commented 11 months ago

@BenjaminRuston , additionally, @jiaruidong2017 and I am discussing this dataset with the EMC obsproc team to see if they can add station ID as the original netcdf data has station ID. But I am still not sure how to move forward to test this dataset.

YoulongXia-NOAA commented 11 months ago

@ClaraDraper-NOAA, as you may know, adpsfc bufr data (GTS) has very sparse station snow depth observations in US and Canada. The snocvr includes snow depth observations from several European countries which is not included in adpsfc bufr data and several thousands stations in US and Canada (from MADIS including SNOTEL and other US station observations). This dataset is being produced by EMC obs proc team as an ope4rtaional dataset. It is an addition for DA work operationally, in particular for north America and several north European countries. This is what I know.

jiaruidong2017 commented 11 months ago

Thanks @YoulongXia-NOAA to provide the detailed information for the new GTS based snow depth data. Yes, it is not ready yet to add a ctest for this data because of the missing station IDs from the US data.

BenjaminRuston commented 11 months ago

if you want to put in the python api solution for LSTN, I will not stop you but you say the station ID as original will not have this problem, just update then?

YoulongXia-NOAA commented 11 months ago

@BenjaminRuston, this is the message from EMC obsproc team: "Some additional information regarding USA LSTNs. As you stated for US snow data, the "LSTN values look more like code for a station, not like a station name". The advantage of these LSTNs is that they are short and can be "shoe-horned" if necessary into WGOSLID". I will wait for their further update for this dateset.

BenjaminRuston commented 11 months ago

can you provide what you have particularly for the python API? I would think it's worth the effort to start looking at the data?

YoulongXia-NOAA commented 11 months ago

@BenjaminRuston, bufr_ncep_snocvr_api.py in /scratch1/NCEPDEV/da/Youlong.Xia/test/pythonAPI on Hera. The output is sizeReduced_gdas.t06z.snocvr.tm00.bufr_d.nc in /scratch1/NCEPDEV/da/Youlong.Xia/test/pythonAPI/testrun.

YoulongXia-NOAA commented 8 months ago

A python API converter is working within GDASApp - https://github.com/NOAA-EMC/GDASApp/pull/848