ecmwf / pdbufr

High-level BUFR interface for ecCodes
Apache License 2.0
23 stars 8 forks source link

UnicodeDecodeError when parsing BUFR file from DWD #28

Open guidocioni opened 3 years ago

guidocioni commented 3 years ago

I haven't seen an open issue on this, forgive me if that's not the case.

I'm running the master version with eccodes v2.21.0.

I can successfully read the BUFR files from German weather stations here https://opendata.dwd.de/weather/weather_reports/synoptic/germany/ (like @meteoDaniel) but not the international ones here https://opendata.dwd.de/weather/weather_reports/synoptic/international/. In the latter case after doing this

df_stations = read_bufr('/tmp/latest.bin',
          columns=('stationOrSiteName',
                   'latitude',
                   'longitude',
                   'heightOfStationGroundAboveMeanSeaLevel',
                   'year', 'month', 'day', 'hour', 'minute',
                   ))

I get

~/miniconda3/lib/python3.8/site-packages/gribapi/gribapi.py in grib_get_string(msgid, key)
    489     err = lib.grib_get_string(h, key.encode(ENC), values, length_p)
    490     GRIB_CHECK(err)
--> 491     return ffi.string(values, length_p[0]).decode(ENC)
    492 
    493 

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

I can successfully see file content using grib_dump but I would like to avoid having to dump everything into a json before :)

shahramn commented 3 years ago

Using the ecCodes bufr_filter tool with the rule

set unpack = 1;
print "Msg #[count]:
   stationOrSiteName=[stationOrSiteName]
   latitude=[latitude], longitude=[longitude],
   heightOfStationGroundAboveMeanSeaLevel=[heightOfStationGroundAboveMeanSeaLevel],
   [year], [month], [day], [hour], [minute]";

It worked for me (no errors issued) using the two input files:

ZC_EDZW_20210518110802_bda01,synop_bufr_999999_999999MW_480.bin ZC_EDZW_latest_bda01,synop_bufr_999999_999999MW_XXX.bin

shahramn commented 3 years ago

Can you please try with fewer keys to pin down which one is causing the error?

guidocioni commented 3 years ago

Can you please try with fewer keys to pin down which one is causing the error?

You're right, without stationOrSiteName I can succesfully read the BUFR file.

I bet it has to do with the fact that some station names contain weird characters :)

Any workaround to avoid the encoding issue?

shahramn commented 3 years ago

Many thanks. Actually it seems this is because some stationOrSiteName values are MISSING. Normally this is a string e.g. VERLEGENHUKEN. This looks like a bug in the ecCodes Python bindings. I am investigating further

guidocioni commented 3 years ago

Many thanks. Actually it seems this is because some stationOrSiteName values are MISSING. Normally this is a string e.g. VERLEGENHUKEN. This looks like a bug in the ecCodes Python bindings. I am investigating further

Yes, from the dump of the bufr the only weird thing that I can see are some stations with missing names/type :) (in the dump JSON this is printed as null)

shahramn commented 3 years ago

I have confirmed that this is indeed a bug in the underlying ecCodes Python3 interface. I am working on a fix

shahramn commented 3 years ago

Can you try the following as a workaround:

from gribapi import *
gribapi.ENC = "unicode-escape"

Then try the rest of your code

guidocioni commented 3 years ago

Can you try the following as a workaround:

from gribapi import *
gribapi.ENC = "unicode-escape"

Then try the rest of your code

yep, that seems to work ;)

shahramn commented 3 years ago

The latest Python bindings for ecCodes fixes this (v1.3.3)