Grouping by data descriptor section is insufficient for different table versions

spanezz commented 5 months ago

Currently input messages are grouped into output files by the contents of their data descriptor section, with the intention of guaranteeing that input data are omogeneous and all fit in the same output arrays.

In the case of the example given at #11 however this breaks: for both messages the data descriptor sections are identical and contain:

301032 321021 025020 025021 008021 004025 101000 031001 321022

However the first message has Table version: 26:1 and the second has Table version: 9:1.

Table 26:1 expands 321022 as:

    007007 Height[M]
    204001 1 bits of associated field
    031021 Associated field significance[CODE TABLE]
    011001 Wind direction[DEGREE TRUE]
    204000 0 bits of associated field
    011002 Wind speed[M/S]
    204001 1 bits of associated field
    031021 Associated field significance[CODE TABLE]
    011006 w-component[M/S]
    204000 0 bits of associated field
    021030 Signal to noise ratio[DB]

While table 9:1 expands it as:

    010007 HEIGHT[M]
    204001 1 bits of associated field
    031021 ASSOCIATED FIELD SIGNIFICANCE[CODE TABLE]
    011001 WIND DIRECTION[DEGREE TRUE]
    204000 0 bits of associated field
    011002 WIND SPEED[M/S]
    204001 1 bits of associated field
    031021 ASSOCIATED FIELD SIGNIFICANCE[CODE TABLE]
    011006 W-COMPONENT[M/S]
    204000 0 bits of associated field
    021030 SIGNAL TO NOISE RATIO[dB]

Causing the 007007/010007 discrepancy that was observed in #11.

spanezz commented 5 months ago

One simple solution to this would be to also group by table versions. That has the downside that if two BUFR files use different tables that however are identical for the codes used, they will end up in different NetCDF files. Would that be an acceptable compromise?

Alternatively, I can think of indexing BUFR files by the recursive expansion of all the B and D codes in their data descriptor section, which may be more computationally expensive.

Alternatively, I can see how complex it would be to do the grouping by table version, and then merge the resulting arrays when they have the same shape.

dcesari commented 5 months ago

Thank you for the detailed analysis, I think the first solution is acceptable considering how cosmo code works.

Would this implicitly solve also #11 without need of conversion?

spanezz commented 5 months ago

I think it would also solve #11 indeed

spanezz commented 5 months ago

I pushed 7098e62e01807e09f418bf00c35d090bdfe51896: can you give it a try?

dcesari commented 5 months ago

I confirm that the modified version works without error also with the complete original BUFR file.

Can we publish a new release or do you need to make further updates?

spanezz commented 5 months ago

I have no other updates planned adn you can publish a new release, I have just updated the NEWS.md

brancomat commented 5 months ago

v1.7-1 released (and already in copr repo)

ARPA-SIMC / bufr2netcdf

Grouping by data descriptor section is insufficient for different table versions #13