can't get some values from ECMWF tf.bufr

DaiDai-Dad commented 1 year ago

What happened?

I have download the .bufr from https://data.ecmwf.int/forecasts/20230905/00z/0p4-beta/oper/, I want get [time, lat, lon, windspeed, pressure] about "13W", then I try this: pdbufr.read_bufr(bufrFp, columns=("stormIdentifier", "timePeriod", "latitude", "longitude", "windSpeedAt10M", "pressureReducedToMeanSeaLevel", ), filters={"stormIdentifier": "13W", }, ) but I get an empty dataframe then I delete "timePeriod" and "windSpeedAt10M" from columns above, it can get 20 rows: ` stormIdentifier latitude longitude pressureReducedToMeanSeaLevel

0 13W 23.4 116.8 99900.0 1 13W 23.5 117.0 100000.0 2 13W 23.5 116.1 100200.0 3 13W 22.8 115.2 100300.0 4 13W 23.4 114.6 100500.0 5 13W 23.4 114.5 100300.0 6 13W 23.5 112.9 100500.0 7 13W NaN NaN NaN 8 13W NaN NaN NaN 9 13W 24.0 112.0 100400.0 10 13W 23.6 111.7 100400.0 11 13W 24.0 109.8 100500.0 12 13W 24.0 109.8 100500.0 13 13W 23.9 109.4 100400.0 14 13W 23.6 107.0 100400.0 15 13W NaN NaN NaN 16 13W 24.4 108.6 100500.0 17 13W 23.7 106.9 100300.0 18 13W 23.7 107.0 100300.0 19 13W 23.7 106.9 100400.0`

then I delete "pressureReducedToMeanSeaLevel" from columns like this: pdbufr.read_bufr(bufrFp, columns=("stormIdentifier", "latitude", "longitude"), filters={"stormIdentifier": "13W", }, ) I got 40 rows: ` stormIdentifier latitude longitude

0 13W 23.5 117.3 1 13W 23.4 116.8 2 13W 23.4 119.9 3 13W 23.5 117.0 4 13W 25.3 121.1 5 13W 23.5 116.1 6 13W 25.5 120.5 7 13W 22.8 115.2 8 13W 22.8 116.7 9 13W 23.4 114.6 10 13W 19.4 116.8 11 13W 23.4 114.5 12 13W 20.1 113.3 13 13W 23.5 112.9 14 13W 20.8 115.4 15 13W NaN NaN 16 13W NaN NaN 17 13W NaN NaN 18 13W NaN NaN 19 13W 24.0 112.0 20 13W 26.2 111.4 21 13W 23.6 111.7 22 13W 22.4 113.7 23 13W 24.0 109.8 24 13W 19.9 111.4 25 13W 24.0 109.8 26 13W 20.0 109.3 27 13W 23.9 109.4 28 13W 22.4 113.7 29 13W 23.6 107.0 30 13W 20.8 110.6 31 13W NaN NaN 32 13W NaN NaN 33 13W 24.4 108.6 34 13W 21.3 109.5 35 13W 23.7 106.9 36 13W 19.5 108.6 37 13W 23.7 107.0 38 13W 20.0 108.2 39 13W 23.7 106.9 40 13W 19.7 107.9`

oh, no, why this? I use eccodes.codes_dump() get a json file, but I can't understand the structure because it's nested multilevel. I can found 31 "timePeriod" in the json by ctrl-f, why can't I get it in the 1st case above? If some row don't have "timePeriod", why not the value is NaN? All in all, why I got different rows within different columns and same filter?

some files are here: https://github.com/DaiDai-Dad/temp/blob/main/20230905000000-240h-oper-tf.bufr https://github.com/DaiDai-Dad/temp/blob/main/output.json

What are the steps to reproduce the bug?

see above

Version

0.11.0

Platform (OS and architecture)

Windows10

Relevant log output

No response

Accompanying data

No response

Organisation

No response

sandorkertesz commented 1 year ago

Thank you for reporting this issue. Please consider the following.

BUFR, as a message is flat, but has a hierarchical (tree) structure. The standards say: This means that each "coordinate descriptor" element introduces a new level in the hierarchy. bufr_dump can reveal this structure for you, but it is not easy to understand what you get. I personally use tools like CodesUI or Metview, which provide graphical user interfaces to inspect the hierarchy.

Now look at the first values of the elements you want to collect from the tree in file 20230905000000-240h-oper-tf.bufr I inspected the first message contaning storm "11E". I took this snapshot with CodesUI (in this tree some nodes with no siblings are not indented, so the structure is simpler than the bufr_dump) :

CodesUI reveals these values:

timePeriod: 6 windSpeedAt10M: 9.8 pressureReducedToMeanSeaLevel: 1000400

Let us look at the tree branches they are located in. In bracket you can see the coordinate descriptor, its first 2 numbers define the "class", remember if "class" < 10 we get a new node in the tree. The elements you want to extract are located under the following tree branches.

"timePeriod" and "windSpeedAt10M":

  |- stromIdentifier: 11E [01025]
    ...
       |- hour: 0 [04004]
          |- minute: 0 [04005]
             ...
             |- meteorologicalAttributeSignificance: 3 [08005]
                |- latitude: 14.3 [05002]
                   |- longitude: -105.6 [06002]
                       |- windSpeedAt10M: 9.8  [11012]
                       ...
                       |- bearingOrAzimuth: 0 [05021]
                          |- timeSignificance: 1 [08021]
                              |- timePeriod: 6 [04024]

"pressureReducedToMeanSeaLevel":

|- stromIdentifier: 11E [01025]
    ...
      |- hour: 0 [04004]
          |- minute: 0 [04005]
             ... 
             |- meteorologicalAttributeSignificance: 5 [08005]
                 |- latitude: 12.4 [05002]
                    |- longitude: -105.9  [06002]
                        |- pressureReducedToMeanSeaLevel: 1000400 [10051]

Now you can see that "windSpeedAt10M" and "pressureReducedToMeanSeaLevel" are in completely different branches under different latitude and logitude values. pdbufr cannot collect them together. However, as you noticed you can use:

columns=("stormIdentifier", "timePeriod", "latitude", "longitude", "windSpeedAt10M",),

and it works. However, because "timePeriod" and "pressureReducedToMeanSeaLevel" are in completely different branches under different latitude and logitude values this query would result in an empty dataframe:

columns=("stormIdentifier", "timePeriod", "latitude", "longitude", "pressureReducedToMeanSeaLevel",),

Your other questions can all be answered by inspecting the tree.

Unfortunately, this message has a very complicated structure and some combinations of the values simply cannot be extracted by the collector algorithm in pdbufr.

As for timePeriod, the message structure is wrong because if we follow the standards simply no timePeriod can be associated with pressureReducedToMeanSeaLevel, which probably was not the intention of the data producer.

sandorkertesz commented 1 year ago

Further complications. Having inspected the whole message structure I can conclude that although we have this hierarchy:

|- stromIdnetifier: 11E [01025]
    ...
       |- hour: 0 [04004]
          |- minute: 0 [04005]
             ...
             |- meteorologicalAttributeSignificance: 3 [08005]
                |- latitude: 14.3 [05002]
                   |- longitude: -105.6 [06002]
                       |- windSpeedAt10M: 9.8  [11012]
                       ...
                       |- bearingOrAzimuth: 0 [05021]
                          |- timeSignificance: 1 [08021]
                              |- timePeriod: 6 [04024]

timePeriod=6 was meant to refer to the next windSpeedAt10M value, and for the first windspeed the timePeriod is simply "missing" assuming its value is 0. No generic bufr filter will ever guess it! We can only conclude it after studying the tree by ourselves. Again, in my view this message was not encoded in the proper way and the structure does not reflect the intention of the data producer.

It causes all kinds of problems in pdbufr. With these columns:

columns=("stormIdentifier", "timePeriod", "latitude", "longitude", "windSpeedAt10M",),

the result is as follows:

   stormIdentifier  latitude  longitude  windSpeedAt10M  timePeriod
0              11E      14.3     -105.6             9.8           6
1              11E       9.4     -103.3            12.9          12
2              11E      10.0     -102.9            13.4          18
...
30             11E      25.7     -135.4            13.9         186

Now, we can notice that:

timeperiod actually refers to the latitude, longitude and windSpeedAt10M in the the next row!
as a consequence, the last latitude, longitude and windSpeedAt10M values from the message , "24.9, -135.9, 14.4", are omitted!

This is not a bug in pdbufr but the result of the way this information was encoded.

DaiDai-Dad commented 1 year ago

Thank you for your reply. With your guidance, I have resolved this issue.

It has been proven that your conclusion is correct. This bufr file is not suitable for reading with pdbufr, as it is nested layer by layer and not a typical phenotype structure.

The method I used is this url（ https://confluence.ecmwf.int/display/ECC/bufr_read_tropical_cyclone ）, as can be seen from this scheme, the data is indeed nested layer by layer (reading each data requires looping and adding ranks). I don't know why the data builder operates this way.

ecmwf / pdbufr