ecmwf / pdbufr

High-level BUFR interface for ecCodes
Apache License 2.0
23 stars 8 forks source link

Column with string data not expanded correctly from compressed subsets #35

Closed blaylockbk closed 1 year ago

blaylockbk commented 2 years ago

Hi, First, I want to thank you for publishing pdbufr. It is awesome and is saving me so much headache.

I am reading aircraft data from EMADDC and my problem is that pdbufr seems to not be parsing the field aircraftRegistrationNumberOrOtherIdentification properly. To follow along with my example, you can get a sample BUFR file from this page (scroll all the way to the bottom and click EHS or MRAR; the MRAR file is much smaller).

import pdbufr

df = pdbufr.read_bufr(
    'EMADDC_KNMI_MRAR_20210909_1500_20210909_1514.bufr',
    columns=[
        'latitude',
        'aircraftRegistrationNumberOrOtherIdentification',
        'airTemperature',
        'numberOfSubsets',
    ]
)
df

image

As you can see, the aircraft identifier is returned as a list of values in the whole subset instead of listing one item per row like the other variables (e.g., latitude, temperature). As far as I can tell, the first 100 rows are identical lists, then the next 100 rows are identical, etc.

My crude work-around is this: since I know there are a maximum of 100 items in each subset, I append a list of identifiers from the list of every 100th row in the DataFrame's aircraftRegistrationNumberOrOtherIdentification column. This is what I expected pdbufr to return for the column.

aircraft_id= []
for i in range(0, len(df), 100):
    subset_list = df['aircraftRegistrationNumberOrOtherIdentification'].iloc[i]
    aircraft_id += subset_list

df['aircraft_id'] = aircraft_id

image


Is this a bug, that pdbufr isn't parsing the aircraftRegistrationNumberOrOtherIdentification correctly, or am I missing a setting or function that unpacks it in this way?

Thanks for your help!

sandorkertesz commented 1 year ago

Hi Brian,

Thank you for reporting this issue and I am sorry for this long delay. I can confirm it is a bug in pdbufr. It seems that keys having string data in compressed subsets are not extracted correctly. We have been busy with other developments but now we plan to work on pdbufr again and fix this and other issues.

Best regards, Sandor

blaylockbk commented 1 year ago

Thanks for the response! I'm excited to see any new developments on this package 😁

sandorkertesz commented 1 year ago

Hi Brian, This issue has been fixed, but not yet released. Thanks again for the very detailed description you added to the issue.

Best regards, Sandor