fsspec / kerchunk

Cloud-friendly access to archival data
https://fsspec.github.io/kerchunk/
MIT License
301 stars 78 forks source link

Variables missing from 'scan_grib', but findable with xarray and cfgrib #358

Open keltonhalbert opened 1 year ago

keltonhalbert commented 1 year ago

I'm encountering an interesting issue where the results of scan_grib differ from interacting with a file via xarray/cfgrib. Particularly, it is not detecting certain variables. Installation information:

Python 3.11.0 | packaged by conda-forge | (main, Jan 14 2023, 12:27:40) [GCC 11.3.0] on linux
kerchunk.__version__ : '0.2.0'
cfgrib.__version__: '0.9.10.4'
xarray.__version__: '2023.5.0'

In this particular case, the missing variable is the 10 meter V wind component. Using scan_grib:

filter_stn10 = {"typeOfLevel":"heightAboveGround",  "level":10 }
gribfile = "./2021/anl/rap_252_20210415_0000_000.grb2"
scan = scan_grib(gribfile, filter=filter_stn10)
mzz = MultiZarrToZarr(scan,
    preprocess=drop(("time", "step")),
    concat_dims = ["heightAboveGround"],
    identical_dims=['latitude', 'longitude'])
d = mzz.translate()
fs = fsspec.filesystem("reference", fo=d)
m = fs.get_mapper("")
ds = xr.open_dataset(m, engine="zarr")
print(ds)

...
<xarray.Dataset>
Dimensions:            (heightAboveGround: 1, y: 225, x: 301, valid_time: 1)
Coordinates:
  * heightAboveGround  (heightAboveGround) int64 10
  * valid_time         (valid_time) datetime64[ns] 2021-04-15
Dimensions without coordinates: y, x
Data variables:
    latitude           (y, x) float64 ...
    longitude          (y, x) float64 ...
    u10                (heightAboveGround, y, x) float64 ...
Attributes:
    centre:             kwbc
    centreDescription:  US National Weather Service - NCEP
    edition:            2
    subCentre:          0

As you can see, there is no v10 variable output. Here's the printed output directly from scan_grib, showing that the data is missing here too:

[{'version': 1, 'refs': {'.zgroup': '{"zarr_format":2}', '.zattrs': '{"centre":"kwbc","centreDescription":"US National Weather Service - NCEP","edition":2,"subCentre":0}', 'u10/.zarray': '{"chunks":[225,301],"compressor":null,"dtype":"<f8","fill_value":3.4028234663852886e+38,"filters":[{"dtype":"float64","id":"grib","var":"u10"}],"order":"C","shape":[225,301],"zarr_format":2}', 'u10/0.0': ['{{u}}', 6501899, 70814], 'u10/.zattrs': '{"NV":0,"_ARRAY_DIMENSIONS":["y","x"],"cfName":"eastward_wind","cfVarName":"u10","dataDate":20210415,"dataTime":0,"dataType":"fc","endStep":0,"gridDefinitionDescription":"Lambert Conformal can be secant or tangent, conical or bipolar","gridType":"lambert","missingValue":3.4028234663852886e+38,"name":"10 metre U wind component","numberOfPoints":67725,"paramId":165,"shortName":"10u","stepType":"instant","stepUnits":1,"typeOfLevel":"heightAboveGround","units":"m s**-1"}', 'heightAboveGround/.zarray': '{"chunks":[1],"compressor":null,"dtype":"<i8","fill_value":null,"filters":null,"order":"C","shape":[1],"zarr_format":2}', 'heightAboveGround/0': '\n\x00\x00\x00\x00\x00\x00\x00', 'heightAboveGround/.zattrs': '{"_ARRAY_DIMENSIONS":["heightAboveGround"],"long_name":"height above the surface","positive":"up","standard_name":"height","units":"m"}', 'latitude/.zarray': '{"chunks":[225,301],"compressor":null,"dtype":"<f8","fill_value":null,"filters":[{"dtype":"float64","id":"grib","var":"latitude"}],"order":"C","shape":[225,301],"zarr_format":2}', 'latitude/0.0': ['{{u}}', 6501899, 70814], 'latitude/.zattrs': '{"_ARRAY_DIMENSIONS":["y","x"],"long_name":"latitude","standard_name":"latitude","units":"degrees_north"}', 'longitude/.zarray': '{"chunks":[225,301],"compressor":null,"dtype":"<f8","fill_value":null,"filters":[{"dtype":"float64","id":"grib","var":"longitude"}],"order":"C","shape":[225,301],"zarr_format":2}', 'longitude/0.0': ['{{u}}', 6501899, 70814], 'longitude/.zattrs': '{"_ARRAY_DIMENSIONS":["y","x"],"long_name":"longitude","standard_name":"longitude","units":"degrees_east"}', 'step/.zarray': '{"chunks":[1],"compressor":null,"dtype":"<f8","fill_value":null,"filters":null,"order":"C","shape":[1],"zarr_format":2}', 'step/0': '\x00\x00\x00\x00\x00\x00\x00\x00', 'step/.zattrs': '{"_ARRAY_DIMENSIONS":["step"],"long_name":"time since forecast_reference_time","standard_name":"forecast_period","units":"hours"}', 'time/.zarray': '{"chunks":[1],"compressor":null,"dtype":"<i8","fill_value":null,"filters":null,"order":"C","shape":[1],"zarr_format":2}', 'time/0': 'base64:AIJ3YAAAAAA=', 'time/.zattrs': '{"_ARRAY_DIMENSIONS":["time"],"calendar":"proleptic_gregorian","long_name":"initial time of forecast","standard_name":"forecast_reference_time","units":"seconds since 1970-01-01T00:00:00"}', 'valid_time/.zarray': '{"chunks":[1],"compressor":null,"dtype":"<i8","fill_value":null,"filters":null,"order":"C","shape":[1],"zarr_format":2}', 'valid_time/0': 'base64:AIJ3YAAAAAA=', 'valid_time/.zattrs': '{"_ARRAY_DIMENSIONS":["valid_time"],"calendar":"proleptic_gregorian","long_name":"time","standard_name":"time","units":"seconds since 1970-01-01T00:00:00"}'}, 'templates': {'u': './2021/anl/rap_252_20210415_0000_000.grb2'}}]

However, when I use xarray/cfgrib, the v10 variable is present and accounted for:

ds = xr.open_dataset(gribfile, engine="cfgrib", backend_kwargs={'filter_by_keys': filter_stn10})
print(ds)
...
<xarray.Dataset>
Dimensions:            (y: 225, x: 301)
Coordinates:
    time               datetime64[ns] ...
    step               timedelta64[ns] ...
    heightAboveGround  float64 ...
    latitude           (y, x) float64 ...
    longitude          (y, x) float64 ...
    valid_time         datetime64[ns] ...
Dimensions without coordinates: y, x
Data variables:
    u10                (y, x) float32 ...
    v10                (y, x) float32 ...
Attributes:
    GRIB_edition:            2
    GRIB_centre:             kwbc
    GRIB_centreDescription:  US National Weather Service - NCEP
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             US National Weather Service - NCEP
    history:                 2023-09-08T16:40 GRIB to CDM+CF via cfgrib-0.9.1...

Output from wgrib2:

1:0:d=2021041500:REFC:entire atmosphere:anl:
2:26161:d=2021041500:VIS:surface:anl:
3:76622:d=2021041500:REFD:1000 m above ground:anl:
4:91602:d=2021041500:REFD:4000 m above ground:anl:
5:103208:d=2021041500:HGT:planetary boundary layer:anl:
6:200740:d=2021041500:GUST:surface:anl:
7:238099:d=2021041500:RETOP:entire atmosphere (considered as a single layer):anl:
8:267373:d=2021041500:HGT:100 mb:anl:
9:299862:d=2021041500:TMP:100 mb:anl:
10:316121:d=2021041500:RH:100 mb:anl:
11:327509:d=2021041500:VVEL:100 mb:anl:
12.1:335559:d=2021041500:UGRD:100 mb:anl:
12.2:335559:d=2021041500:VGRD:100 mb:anl:
13:374100:d=2021041500:HGT:125 mb:anl:
14:406398:d=2021041500:TMP:125 mb:anl:
15:423480:d=2021041500:RH:125 mb:anl:
16:436337:d=2021041500:VVEL:125 mb:anl:
17.1:446670:d=2021041500:UGRD:125 mb:anl:
17.2:446670:d=2021041500:VGRD:125 mb:anl:
18:486922:d=2021041500:HGT:150 mb:anl:
19:519400:d=2021041500:TMP:150 mb:anl:
20:537730:d=2021041500:RH:150 mb:anl:
21:553822:d=2021041500:VVEL:150 mb:anl:
22.1:565648:d=2021041500:UGRD:150 mb:anl:
22.2:565648:d=2021041500:VGRD:150 mb:anl:
23:608532:d=2021041500:HGT:175 mb:anl:
24:641616:d=2021041500:TMP:175 mb:anl:
25:660500:d=2021041500:RH:175 mb:anl:
26:681375:d=2021041500:VVEL:175 mb:anl:
27.1:694368:d=2021041500:UGRD:175 mb:anl:
27.2:694368:d=2021041500:VGRD:175 mb:anl:
28:738113:d=2021041500:HGT:200 mb:anl:
29:772347:d=2021041500:TMP:200 mb:anl:
30:790862:d=2021041500:RH:200 mb:anl:
31:817088:d=2021041500:VVEL:200 mb:anl:
32.1:831054:d=2021041500:UGRD:200 mb:anl:
32.2:831054:d=2021041500:VGRD:200 mb:anl:
33:876274:d=2021041500:HGT:225 mb:anl:
34:910505:d=2021041500:TMP:225 mb:anl:
35:928298:d=2021041500:RH:225 mb:anl:
36:958969:d=2021041500:VVEL:225 mb:anl:
37.1:974641:d=2021041500:UGRD:225 mb:anl:
37.2:974641:d=2021041500:VGRD:225 mb:anl:
38:1021922:d=2021041500:HGT:250 mb:anl:
39:1055864:d=2021041500:TMP:250 mb:anl:
40:1073010:d=2021041500:RH:250 mb:anl:
41:1107102:d=2021041500:VVEL:250 mb:anl:
42.1:1124398:d=2021041500:UGRD:250 mb:anl:
42.2:1124398:d=2021041500:VGRD:250 mb:anl:
43:1173425:d=2021041500:HGT:275 mb:anl:
44:1207228:d=2021041500:TMP:275 mb:anl:
45:1223656:d=2021041500:RH:275 mb:anl:
46:1258875:d=2021041500:VVEL:275 mb:anl:
47.1:1277338:d=2021041500:UGRD:275 mb:anl:
47.2:1277338:d=2021041500:VGRD:275 mb:anl:
48:1325698:d=2021041500:HGT:300 mb:anl:
49:1358872:d=2021041500:TMP:300 mb:anl:
50:1375309:d=2021041500:RH:300 mb:anl:
51:1412409:d=2021041500:VVEL:300 mb:anl:
52.1:1431873:d=2021041500:UGRD:300 mb:anl:
52.2:1431873:d=2021041500:VGRD:300 mb:anl:
53:1481013:d=2021041500:HGT:325 mb:anl:
54:1513807:d=2021041500:TMP:325 mb:anl:
55:1530216:d=2021041500:RH:325 mb:anl:
56:1568118:d=2021041500:VVEL:325 mb:anl:
57.1:1588258:d=2021041500:UGRD:325 mb:anl:
57.2:1588258:d=2021041500:VGRD:325 mb:anl:
58:1637414:d=2021041500:HGT:350 mb:anl:
59:1669796:d=2021041500:TMP:350 mb:anl:
60:1686108:d=2021041500:RH:350 mb:anl:
61:1724086:d=2021041500:VVEL:350 mb:anl:
62.1:1744937:d=2021041500:UGRD:350 mb:anl:
62.2:1744937:d=2021041500:VGRD:350 mb:anl:
63:1793360:d=2021041500:HGT:375 mb:anl:
64:1825237:d=2021041500:TMP:375 mb:anl:
65:1841628:d=2021041500:RH:375 mb:anl:
66:1879492:d=2021041500:VVEL:375 mb:anl:
67.1:1900565:d=2021041500:UGRD:375 mb:anl:
67.2:1900565:d=2021041500:VGRD:375 mb:anl:
68:1948191:d=2021041500:HGT:400 mb:anl:
69:1979803:d=2021041500:TMP:400 mb:anl:
70:1996163:d=2021041500:RH:400 mb:anl:
71:2033781:d=2021041500:VVEL:400 mb:anl:
72.1:2055147:d=2021041500:UGRD:400 mb:anl:
72.2:2055147:d=2021041500:VGRD:400 mb:anl:
73:2101958:d=2021041500:HGT:425 mb:anl:
74:2133393:d=2021041500:TMP:425 mb:anl:
75:2150030:d=2021041500:RH:425 mb:anl:
76:2187451:d=2021041500:VVEL:425 mb:anl:
77.1:2208831:d=2021041500:UGRD:425 mb:anl:
77.2:2208831:d=2021041500:VGRD:425 mb:anl:
78:2255515:d=2021041500:HGT:450 mb:anl:
79:2286390:d=2021041500:TMP:450 mb:anl:
80:2303182:d=2021041500:RH:450 mb:anl:
81:2341090:d=2021041500:VVEL:450 mb:anl:
82.1:2362485:d=2021041500:UGRD:450 mb:anl:
82.2:2362485:d=2021041500:VGRD:450 mb:anl:
83:2409552:d=2021041500:HGT:475 mb:anl:
84:2439985:d=2021041500:TMP:475 mb:anl:
85:2456849:d=2021041500:RH:475 mb:anl:
86:2494921:d=2021041500:VVEL:475 mb:anl:
87.1:2516387:d=2021041500:UGRD:475 mb:anl:
87.2:2516387:d=2021041500:VGRD:475 mb:anl:
88:2562680:d=2021041500:HGT:500 mb:anl:
89:2593117:d=2021041500:TMP:500 mb:anl:
90:2609777:d=2021041500:RH:500 mb:anl:
91:2647424:d=2021041500:VVEL:500 mb:anl:
92.1:2668671:d=2021041500:UGRD:500 mb:anl:
92.2:2668671:d=2021041500:VGRD:500 mb:anl:
93:2713923:d=2021041500:ABSV:500 mb:anl:
94:2746037:d=2021041500:HGT:525 mb:anl:
95:2776012:d=2021041500:TMP:525 mb:anl:
96:2792992:d=2021041500:RH:525 mb:anl:
97:2831347:d=2021041500:VVEL:525 mb:anl:
98.1:2852812:d=2021041500:UGRD:525 mb:anl:
98.2:2852812:d=2021041500:VGRD:525 mb:anl:
99:2898821:d=2021041500:HGT:550 mb:anl:
100:2928968:d=2021041500:TMP:550 mb:anl:
101:2945598:d=2021041500:RH:550 mb:anl:
102:2983308:d=2021041500:VVEL:550 mb:anl:
103.1:3004714:d=2021041500:UGRD:550 mb:anl:
103.2:3004714:d=2021041500:VGRD:550 mb:anl:
104:3049897:d=2021041500:HGT:575 mb:anl:
105:3079709:d=2021041500:TMP:575 mb:anl:
106:3096752:d=2021041500:RH:575 mb:anl:
107:3135210:d=2021041500:VVEL:575 mb:anl:
108.1:3156734:d=2021041500:UGRD:575 mb:anl:
108.2:3156734:d=2021041500:VGRD:575 mb:anl:
109:3202819:d=2021041500:HGT:600 mb:anl:
110:3232685:d=2021041500:TMP:600 mb:anl:
111:3249719:d=2021041500:RH:600 mb:anl:
112:3287641:d=2021041500:VVEL:600 mb:anl:
113.1:3309175:d=2021041500:UGRD:600 mb:anl:
113.2:3309175:d=2021041500:VGRD:600 mb:anl:
114:3354562:d=2021041500:HGT:625 mb:anl:
115:3384078:d=2021041500:TMP:625 mb:anl:
116:3401681:d=2021041500:RH:625 mb:anl:
117:3440059:d=2021041500:VVEL:625 mb:anl:
118.1:3461817:d=2021041500:UGRD:625 mb:anl:
118.2:3461817:d=2021041500:VGRD:625 mb:anl:
119:3508232:d=2021041500:HGT:650 mb:anl:
120:3537878:d=2021041500:TMP:650 mb:anl:
121:3555657:d=2021041500:RH:650 mb:anl:
122:3593791:d=2021041500:VVEL:650 mb:anl:
123.1:3615582:d=2021041500:UGRD:650 mb:anl:
123.2:3615582:d=2021041500:VGRD:650 mb:anl:
124:3661703:d=2021041500:HGT:675 mb:anl:
125:3691143:d=2021041500:TMP:675 mb:anl:
126:3709488:d=2021041500:RH:675 mb:anl:
127:3748175:d=2021041500:VVEL:675 mb:anl:
128.1:3770420:d=2021041500:UGRD:675 mb:anl:
128.2:3770420:d=2021041500:VGRD:675 mb:anl:
129:3817288:d=2021041500:HGT:700 mb:anl:
130:3846777:d=2021041500:TMP:700 mb:anl:
131:3865409:d=2021041500:RH:700 mb:anl:
132:3903940:d=2021041500:VVEL:700 mb:anl:
133.1:3926267:d=2021041500:UGRD:700 mb:anl:
133.2:3926267:d=2021041500:VGRD:700 mb:anl:
134:3973264:d=2021041500:HGT:725 mb:anl:
135:4002869:d=2021041500:TMP:725 mb:anl:
136:4021904:d=2021041500:RH:725 mb:anl:
137:4060819:d=2021041500:VVEL:725 mb:anl:
138.1:4083328:d=2021041500:UGRD:725 mb:anl:
138.2:4083328:d=2021041500:VGRD:725 mb:anl:
139:4130998:d=2021041500:HGT:750 mb:anl:
140:4160663:d=2021041500:TMP:750 mb:anl:
141:4180313:d=2021041500:RH:750 mb:anl:
142:4219649:d=2021041500:VVEL:750 mb:anl:
143.1:4242537:d=2021041500:UGRD:750 mb:anl:
143.2:4242537:d=2021041500:VGRD:750 mb:anl:
144:4290730:d=2021041500:HGT:775 mb:anl:
145:4320791:d=2021041500:TMP:775 mb:anl:
146:4340824:d=2021041500:RH:775 mb:anl:
147:4380169:d=2021041500:VVEL:775 mb:anl:
148.1:4403481:d=2021041500:UGRD:775 mb:anl:
148.2:4403481:d=2021041500:VGRD:775 mb:anl:
149:4451919:d=2021041500:HGT:800 mb:anl:
150:4482100:d=2021041500:TMP:800 mb:anl:
151:4502707:d=2021041500:RH:800 mb:anl:
152:4542641:d=2021041500:VVEL:800 mb:anl:
153.1:4566246:d=2021041500:UGRD:800 mb:anl:
153.2:4566246:d=2021041500:VGRD:800 mb:anl:
154:4615301:d=2021041500:HGT:825 mb:anl:
155:4645730:d=2021041500:TMP:825 mb:anl:
156:4666719:d=2021041500:RH:825 mb:anl:
157:4706917:d=2021041500:VVEL:825 mb:anl:
158.1:4730846:d=2021041500:UGRD:825 mb:anl:
158.2:4730846:d=2021041500:VGRD:825 mb:anl:
159:4779668:d=2021041500:HGT:850 mb:anl:
160:4810539:d=2021041500:TMP:850 mb:anl:
161:4832050:d=2021041500:RH:850 mb:anl:
162:4873044:d=2021041500:VVEL:850 mb:anl:
163.1:4897355:d=2021041500:UGRD:850 mb:anl:
163.2:4897355:d=2021041500:VGRD:850 mb:anl:
164:4945786:d=2021041500:HGT:875 mb:anl:
165:4977212:d=2021041500:TMP:875 mb:anl:
166:4999438:d=2021041500:RH:875 mb:anl:
167:5041862:d=2021041500:VVEL:875 mb:anl:
168.1:5066347:d=2021041500:UGRD:875 mb:anl:
168.2:5066347:d=2021041500:VGRD:875 mb:anl:
169:5114561:d=2021041500:HGT:900 mb:anl:
170:5146580:d=2021041500:TMP:900 mb:anl:
171:5169539:d=2021041500:RH:900 mb:anl:
172:5212900:d=2021041500:VVEL:900 mb:anl:
173.1:5237345:d=2021041500:UGRD:900 mb:anl:
173.2:5237345:d=2021041500:VGRD:900 mb:anl:
174:5285710:d=2021041500:HGT:925 mb:anl:
175:5318772:d=2021041500:TMP:925 mb:anl:
176:5342361:d=2021041500:RH:925 mb:anl:
177:5385536:d=2021041500:VVEL:925 mb:anl:
178.1:5409454:d=2021041500:UGRD:925 mb:anl:
178.2:5409454:d=2021041500:VGRD:925 mb:anl:
179:5458125:d=2021041500:HGT:950 mb:anl:
180:5492552:d=2021041500:TMP:950 mb:anl:
181:5516493:d=2021041500:RH:950 mb:anl:
182:5559001:d=2021041500:VVEL:950 mb:anl:
183.1:5581810:d=2021041500:UGRD:950 mb:anl:
183.2:5581810:d=2021041500:VGRD:950 mb:anl:
184:5630669:d=2021041500:HINDEX:surface:anl:
185:5648457:d=2021041500:HGT:975 mb:anl:
186:5684345:d=2021041500:TMP:975 mb:anl:
187:5708317:d=2021041500:RH:975 mb:anl:
188:5750512:d=2021041500:VVEL:975 mb:anl:
189.1:5771207:d=2021041500:UGRD:975 mb:anl:
189.2:5771207:d=2021041500:VGRD:975 mb:anl:
190:5820199:d=2021041500:TMP:1000 mb:anl:
191:5844251:d=2021041500:RH:1000 mb:anl:
192:5885460:d=2021041500:VVEL:1000 mb:anl:
193.1:5903388:d=2021041500:UGRD:1000 mb:anl:
193.2:5903388:d=2021041500:VGRD:1000 mb:anl:
194:5951381:d=2021041500:MSLMA:mean sea level:anl:
195:5973901:d=2021041500:HGT:1000 mb:anl:
196:6005539:d=2021041500:PRES:surface:anl:
197:6032368:d=2021041500:HGT:surface:anl:
198:6093540:d=2021041500:TMP:surface:anl:
199:6136917:d=2021041500:ASNOW:surface:0-0 day acc fcst:
200:6141452:d=2021041500:MSTAV:0 m underground:anl:
201:6185929:d=2021041500:WEASD:surface:anl:
202:6207731:d=2021041500:SNOD:surface:anl:
203:6226548:d=2021041500:TMP:2 m above ground:anl:
204:6264127:d=2021041500:POT:2 m above ground:anl:
205:6298296:d=2021041500:SPFH:2 m above ground:anl:
206:6348538:d=2021041500:DPT:2 m above ground:anl:
207:6389047:d=2021041500:DEPR:2 m above ground:anl:
208:6432553:d=2021041500:EPOT:surface:anl:
209:6474355:d=2021041500:RH:2 m above ground:anl:
210.1:6501899:d=2021041500:UGRD:10 m above ground:anl:
210.2:6501899:d=2021041500:VGRD:10 m above ground:anl:
211:6572713:d=2021041500:PRATE:surface:anl:
212:6578933:d=2021041500:APCP:surface:0-0 day acc fcst:
213:6579147:d=2021041500:ACPCP:surface:0-0 day acc fcst:
214:6579361:d=2021041500:WEASD:surface:0-0 day acc fcst:
215:6579575:d=2021041500:FROZR:surface:0-0 day acc fcst:
216:6579789:d=2021041500:FRZR:surface:0-0 day acc fcst:
217:6585662:d=2021041500:SSRUN:surface:0-0 day acc fcst:
218:6585876:d=2021041500:BGRUN:surface:0-0 day acc fcst:
219:6586090:d=2021041500:HGT:lowest level of the wet bulb zero:anl:
220:6644782:d=2021041500:CSNOW:surface:anl:
221:6645323:d=2021041500:CICEP:surface:anl:
222:6645712:d=2021041500:CFRZR:surface:anl:
223:6646316:d=2021041500:CRAIN:surface:anl:
224:6648381:d=2021041500:LFTX:500-1000 mb:anl:
225:6679044:d=2021041500:CAPE:surface:anl:
226:6693054:d=2021041500:CIN:surface:anl:
227:6713470:d=2021041500:PWAT:entire atmosphere (considered as a single layer):anl:
228:6744526:d=2021041500:LCDC:low cloud layer:anl:
229:6777549:d=2021041500:MCDC:middle cloud layer:anl:
230:6801650:d=2021041500:HCDC:high cloud layer:anl:
231:6825235:d=2021041500:TCDC:entire atmosphere:anl:
232:6861883:d=2021041500:HGT:convective cloud top level:anl:
233:6870105:d=2021041500:HGT:cloud base:anl:
234:6953196:d=2021041500:HGT:cloud top:anl:
235:7003356:d=2021041500:HLCY:3000-0 m above ground:anl:
236:7021427:d=2021041500:HLCY:1000-0 m above ground:anl:
237.1:7056753:d=2021041500:USTM:0-6000 m above ground:anl:
237.2:7056753:d=2021041500:VSTM:0-6000 m above ground:anl:
238.1:7123670:d=2021041500:VUCSH:0-6000 m above ground:anl:
238.2:7123670:d=2021041500:VVCSH:0-6000 m above ground:anl:
239:7203828:d=2021041500:PRES:tropopause:anl:
240:7229619:d=2021041500:TMP:tropopause:anl:
241:7253016:d=2021041500:POT:tropopause:anl:
242.1:7270630:d=2021041500:UGRD:tropopause:anl:
242.2:7270630:d=2021041500:VGRD:tropopause:anl:
243:7343782:d=2021041500:PRES:max wind:anl:
244.1:7377534:d=2021041500:UGRD:max wind:anl:
244.2:7377534:d=2021041500:VGRD:max wind:anl:
245:7454060:d=2021041500:TMP:80 m above ground:anl:
246:7485089:d=2021041500:SPFH:80 m above ground:anl:
247:7533905:d=2021041500:PRES:80 m above ground:anl:
248.1:7560656:d=2021041500:UGRD:80 m above ground:anl:
248.2:7560656:d=2021041500:VGRD:80 m above ground:anl:
249:7623695:d=2021041500:HGT:0C isotherm:anl:
250:7649318:d=2021041500:RH:0C isotherm:anl:
251:7675611:d=2021041500:PRES:0C isotherm:anl:
252:7701235:d=2021041500:HGT:highest tropospheric freezing level:anl:
253:7726687:d=2021041500:RH:highest tropospheric freezing level:anl:
254:7752641:d=2021041500:PRES:highest tropospheric freezing level:anl:
255:7777885:d=2021041500:TMP:30-0 mb above ground:anl:
256:7808855:d=2021041500:RH:30-0 mb above ground:anl:
257.1:7835801:d=2021041500:UGRD:30-0 mb above ground:anl:
257.2:7835801:d=2021041500:VGRD:30-0 mb above ground:anl:
258:7898095:d=2021041500:VVEL:30-0 mb above ground:anl:
259:7929464:d=2021041500:TMP:60-30 mb above ground:anl:
260:7960051:d=2021041500:RH:60-30 mb above ground:anl:
261.1:7988231:d=2021041500:UGRD:60-30 mb above ground:anl:
261.2:7988231:d=2021041500:VGRD:60-30 mb above ground:anl:
262:8048913:d=2021041500:VVEL:60-30 mb above ground:anl:
263:8086174:d=2021041500:TMP:90-60 mb above ground:anl:
264:8117502:d=2021041500:RH:90-60 mb above ground:anl:
265.1:8147117:d=2021041500:UGRD:90-60 mb above ground:anl:
265.2:8147117:d=2021041500:VGRD:90-60 mb above ground:anl:
266:8207930:d=2021041500:VVEL:90-60 mb above ground:anl:
267:8247025:d=2021041500:TMP:120-90 mb above ground:anl:
268:8278796:d=2021041500:RH:120-90 mb above ground:anl:
269.1:8310264:d=2021041500:UGRD:120-90 mb above ground:anl:
269.2:8310264:d=2021041500:VGRD:120-90 mb above ground:anl:
270:8371462:d=2021041500:VVEL:120-90 mb above ground:anl:
271:8410645:d=2021041500:TMP:150-120 mb above ground:anl:
272:8441827:d=2021041500:RH:150-120 mb above ground:anl:
273.1:8473432:d=2021041500:UGRD:150-120 mb above ground:anl:
273.2:8473432:d=2021041500:VGRD:150-120 mb above ground:anl:
274:8535142:d=2021041500:VVEL:150-120 mb above ground:anl:
275:8573559:d=2021041500:TMP:180-150 mb above ground:anl:
276:8604398:d=2021041500:RH:180-150 mb above ground:anl:
277.1:8634622:d=2021041500:UGRD:180-150 mb above ground:anl:
277.2:8634622:d=2021041500:VGRD:180-150 mb above ground:anl:
278:8697064:d=2021041500:VVEL:180-150 mb above ground:anl:
279:8734600:d=2021041500:4LFTX:180-0 mb above ground:anl:
280:8767961:d=2021041500:CAPE:180-0 mb above ground:anl:
281:8783098:d=2021041500:CIN:180-0 mb above ground:anl:
282:8803052:d=2021041500:HPBL:surface:anl:
283:8900795:d=2021041500:CAPE:90-0 mb above ground:anl:
284:8911585:d=2021041500:CIN:90-0 mb above ground:anl:
285:8919358:d=2021041500:CAPE:255-0 mb above ground:anl:
286:8934973:d=2021041500:CIN:255-0 mb above ground:anl:
287:8955339:d=2021041500:HGT:equilibrium level:anl:
288:9036520:d=2021041500:PLPL:255-0 mb above ground:anl:
289:9080430:d=2021041500:LTNG:surface:anl:
290:9081210:d=2021041500:RHPW:entire atmosphere:anl:
291:9102709:d=2021041500:SBT123:top of atmosphere:anl:
292:9151461:d=2021041500:SBT124:top of atmosphere:anl:
293:9216577:d=2021041500:SBT113:top of atmosphere:anl:
294:9261248:d=2021041500:SBT114:top of atmosphere:anl:

Any suggestions on what I may be doing wrong, or where the issue might lie? Happy to make a PR if there's a bug, just not sure if 1) there is one and 2) where to start addressing it.

keltonhalbert commented 1 year ago

If it's helpful, I uploaded the grib2 file to a google drive. It's just under 10 MB in size.

keltonhalbert commented 1 year ago

Sorry to ping again - I know I posted the issue on a Friday, so my apologies there. Just wanting to know if there are any ideas about where and how kerchunk is missing a variable in the dataset.

martindurant commented 1 year ago

I see

210.1:6501899:d=2021041500:UGRD:10 m above ground:anl:
210.2:6501899:d=2021041500:VGRD:10 m above ground:anl:

but kerchunk is only matching one grib message. I'm not certain what the codes in those two lines mean, but I wonder whether they are components of the SAME message - there are indeed 294 messages in the file. Investigating...

martindurant commented 1 year ago

I fear I may need a GRIB expert to figure this one out. The values returned in variable u10 are indeed the same values as found by xarray/cfgrib. The offset and message size are correct, but I don't know how the other component of the vector in the same grib message is found.

keltonhalbert commented 1 year ago

Thanks for taking a look at this, I really appreciate it! I'm by no means a grib expert, but I'm trying to track down what makes this encoding different. So far I've stumbled across some grib documentation that mentions velocity fields encoded in sub-messages...

I haven't made much progress beyond this, but I'm trying to poke around more documentation with grib2 and cfgrib to see how this is handled. I'll report back if I find anything else useful.

martindurant commented 1 year ago

That does seem like it's talking about the same thing, but I don't see how this is handled in cfgrib/eccodes .

(cf https://github.com/noritada/grib-rs/issues/13 )

keltonhalbert commented 1 year ago

I may have at least narrowed down a rough idea of where in cfgrib this happens, but as far as how to activate/enable it, I'm still trying to untangle.

In eccodes/cfgrib, this is referred to as a multi-field grib file. Within cfgrib/messages.py, there are some functions and conditionals that interact with multi_enabled - presumably, this is how cfgrib/xarray manages supporting this feature. Still trying to reckon with exactly where and how this should be called/enabled/detected by kerchunk.

keltonhalbert commented 1 year ago

Alright, I made some progress here. I have no idea if this is the right way to go about this, but I basically used the logic of the FileStream class in cfgrib.messages, and I can successfully print out all variables.

def main():
    import eccodes
    import cfgrib
    from cfgrib import messages
    grbfile = "./rap_252_20210415_0000_000.grb2"
    fstream = messages.FileStream(grbfile)

    with open(grbfile, 'rb') as f:
        with messages.multi_enabled(f):
            valid_message_found = False
            while True:
                try: 
                    msg = fstream.message_from_file(f)
                    print(msg["cfVarName"])
                    valid_message_found = True
                except EOFError:
                    if not valid_message_found:
                        raise EOFError("No valid message found")
                    break

I'll spare dumping the whole output, but now I see v/v10 variables in the output!

hindex
gh
t
r
w
u
v
t
r
w
u
v
unknown
gh
sp
orog
t
unknown
unknown
sdwe
sde
t2m
pt
sh2
d2m
unknown
papt
r2
u10
v10
prate
unknown
acpcp
sdwe
unknown
unknown
ssrun
bgrun

So, it seems like the weird quirk of multi_enabled is required for cfgrib to handle the parsing of those messages. I'm not sure if you guys would prefer trying to get to an even lower level of handling this, and I don't know if the caveats in the documentation are deal breakers:

#
# MULTI-FIELD support is very tricky. Random access via the index needs multi support to be off.
#
...
#
    # Explicitly reset the multi_support global state that gets confused by random access
    #
    # @alexamici: I'm note sure this is thread-safe. See :#141
    #

However, if there are no objections to changing the iteration logic of scan_grib, I can modify it to follow this procedure and make a pull request.

martindurant commented 1 year ago

if there are no objections to changing the iteration logic of scan_grib

I am not opposed to that, except that I had hoped to only need eccodes and not also cfgrib during the access phase. Maybe that doesn't matter, so long as we can still interpret a block of bytes as a message.

The GRIB codec would need to be updated to do something similar. The codec reads whole messages at a time, because we don't know how to decode the interior buffers of a GRIB (sub)message unfortunately. That would mean that we need to tell the codec which submessage we want - assuming this is consistent across all input files/messages - and would end up temporarily loading the bytes for both variables when trying to access either one.

martindurant commented 1 year ago

MULTI-FIELD support is very tricky. Random access via the index needs multi support to be off.

is interesting, because kerchunk essentially has its own index and always does random access. As above, if only the "where is the buffer" and "decode this bytes buffer" were accessible calls in the eccodes API, life would be much easier for everyone.

keltonhalbert commented 1 year ago

If the desire is to keep things purely in terms of eccodes, then I'll double check and see how doable that is before moving forward with the cfgrib parser. Most of the calls under the hood are to eccodes, so it may be possible to rig it to work directly. This was kind of my brute-force "make it work" approach. Let me see what I can do.

imcslatte commented 6 months ago

Has there been any progress on this issue? I'm running into the same problem with data in the NAM s3 bucket. I'm unable to access see the v10 data using kerchunk, but the u10 message seems okay.

Eli

martindurant commented 6 months ago

@emfdavid , I don't suppose you've come across this kind of thinkg in your grib travels?

keltonhalbert commented 6 months ago

Hi @imcslatte - apologies for the slow response. Unfortunately due to time constraints and other responsibilities, I wasn't able to revisit this problem in order to fix it using the eccodes API directly. I have a temporary fix listed above that uses the cfgrib API, which is built on top of eccodes.

I might be able to find some time in the next week or two to investigate fixing this directly from eccodes, now that I've been reminded of this... as I also have some upcoming projects with RAP data where having this issue fixed will be helpful :). If you want to implement the stopgap fix locally in your environment, I'd be happy to help.

emfdavid commented 6 months ago

I have not seen sub messages in the HRRR or GFS/GEFS grib2 files. Yet another new wrinkle! My adventures in parsing grib files are documented here.

imcslatte commented 6 months ago

@keltonhalbert @martindurant I was wondering if part of the problem might be that the submessage IDs are decimals and not integers? If they are typed as integer the ids 210.1 and 210.2 would be the same.

Just a thought.

rsignell commented 6 months ago

@imcslatte , ooh, so

274:8535142:d=2021041500:VVEL:150-120 mb above ground:anl:
275:8573559:d=2021041500:TMP:180-150 mb above ground:anl:
276:8604398:d=2021041500:RH:180-150 mb above ground:anl:
277.1:8634622:d=2021041500:UGRD:180-150 mb above ground:anl:     <== this gets handled
277.2:8634622:d=2021041500:VGRD:180-150 mb above ground:anl:     <== this gets dropped
278:8697064:d=2021041500:VVEL:180-150 mb above ground:anl:
279:8734600:d=2021041500:4LFTX:180-0 mb above ground:anl:
280:8767961:d=2021041500:CAPE:180-0 mb above ground:anl:

hoping this is the simple problem @martindurant !

martindurant commented 6 months ago

I am not sure that we make any use of the ID value. I think the problem is, that the two arrays are bundled in the same grib message, and I don't know how to tell the cfgrib API "load the second sub-message". If I had a spare year, perhaps I could dig into the internals of grib to understand this, but for now I'll have to rely on people like @mpiannucci !

keltonhalbert commented 6 months ago

Yeah the main problem is how GRIB sub-messages get handled. Per the eccodes documentation, sub-messages are not a feature that is recommended to be used, but NCEP has been doing this for RAP and NAM products (not HRRR or GFS though) for a while .

The reason it works in cfgrib but not kerchunk is because cfgrib incorporates special eccodes logic in order to account for grib sub-messages, where kerchunk does not implement said logic. As far as I can tell, the bulk of the cfgrib calls to eccodes that kerchunk needs to duplicate are contained within the messages file.

#
# MULTI-FIELD support is very tricky. Random access via the index needs multi support to be off.
#
eccodes.codes_grib_multi_support_off()

@contextlib.contextmanager
def multi_enabled(file: T.IO[bytes]) -> T.Iterator[None]:
    """Context manager that enables MULTI-FIELD support in ecCodes from a clean state"""
    eccodes.codes_grib_multi_support_on()
    #
    # Explicitly reset the multi_support global state that gets confused by random access
    #
    # @alexamici: I'm note sure this is thread-safe. See :#141
    #
    eccodes.codes_grib_multi_support_reset_file(file)
    try:
        yield
    except Exception:
        eccodes.codes_grib_multi_support_off()
        raise
    eccodes.codes_grib_multi_support_off()

The multi_enabled function gets called when iterating over the number of messages in a file:

@attr.attrs(auto_attribs=True)
class Message(abc.MutableField):
    """Dictionary-line interface to access Message headers."""

    codes_id: int
    encoding: str = "ascii"
    errors: str = attr.attrib(
        default="warn", validator=attr.validators.in_(["ignore", "warn", "raise"])
    )

    @classmethod
    def from_file(cls, file, offset=None, **kwargs):
        # type: (T.IO[bytes], T.Optional[OffsetType], T.Any) -> Message
        field_in_message = 0
        if isinstance(offset, tuple):
            offset, field_in_message = offset
        if offset is not None:
            file.seek(offset)
        codes_id = None
        if field_in_message == 0:
            codes_id = eccodes.codes_grib_new_from_file(file)
        else:
            # MULTI-FIELD is enabled only when accessing additional fields
            with multi_enabled(file):
                for _ in range(field_in_message + 1):
                    codes_id = eccodes.codes_grib_new_from_file(file)

        if codes_id is None:
            raise EOFError("End of file: %r" % file)
        return cls(codes_id=codes_id, **kwargs)

Of particular note is checking if field_in_message == 0, which appears to come from wherever the offset tuple is passed from.

Now, as far as addressing this issue, my understanding is the following:

  1. The desire is to to keep eccodes as the only dependency, and not to build on top of cfgrib. This should be possible, but means figuring out when and where to call the equivalent of multi_enabled within kerchunk. Presumably, that starts with coming up with a means of detecting the presence of multi-field grib messages from the eccodes API, and then inserting it accordingly.
  2. Multi-field support may have unintended consequences to random access, and per code comments, may not be thread safe. Since kerchunk offers concurrent, asynchronous fetching of remote data, my concern is that enabling multi_field support somehow could violate that contract.

I feel pretty confident in being able to come up with a solution, but not confident that it'll be the correct solution. I guess the only way to know for sure is to try something in a fork of the repository. So, my question becomes, do the developers/maintainers have any input before diving head first into a naive solution?

TAdeJong commented 4 months ago

As another potentially helpful datapoint: I tried using kerchunk on ERA5 single level grib files and encounter the same issue (downloaded from ECMWF (https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-single-levels) to our network share, but still painfully slow to load, so I thought I would give kerchunk a try).

Here, the grib file contains 4 wind variables, u100, v100, u10 and v10, of which only u100 is picked up by kerchunk.

Furthermore, the file contains a time dimension of 24 hours, which is recognized by cfgrib, but not by kerchunk: 'time/.zarray': '{"chunks":[],"compressor":null,"dtype":"<i8","fill_value":null,"filters":null,"order":"C","shape":[],"zarr_format":2}',. This might be related to https://github.com/fsspec/kerchunk/issues/150, although here time is missing instead of steps as this is a reanalysis product.

I would love to help, but unfortunately, I am also not a grib expert. Also happy to open a separate issue is that helps.

martindurant commented 4 months ago

I am also not a grib expert

Is anyone? :)

c-fjord commented 1 month ago

@TAdeJong I am currently facing the same issue where Kerchunk is not able to "detect" the 24 different hours within the file. Have you found a solution to the problem?

martindurant commented 1 month ago

@mpiannucci - have you had any chance to work with sub-messages?

mpiannucci commented 1 month ago

I am pretty sure that the copernicus climate store exports single level data in a grib1 file, gribberish cant even read it because it doesnt contain all the grib2 sections when scanning through, when i downloaded a 24 hour file with u100, v100, u10, v10. I may be wrong though, i havent worked with euro data much.

maresb commented 1 month ago

I am also struggling with ECMWF data from the new CDS beta API due to a missing time dimension. In hopes of getting any "grib expert" a head-start on this, here's a minimal example with a nice tiny 4MB GRIB, also available via Google Drive.

import fsspec
import xarray as xr
from kerchunk.grib2 import GribToZarr

DOWNLOADED_GRIB = "/tmp/downloaded.grib"

def download_grib():
    """You don't actually need to run this. Instead download the grib to the
    DOWNLOADED_GRIB. location. This is included for completeness.
    """
    import os

    import cdsapi

    # os.environ["CDSAPI_KEY"] = "xxxxxxxx"

    dataset = "reanalysis-era5-single-levels"
    request = {
        "product_type": ["reanalysis"],
        "variable": ["2m_temperature"],
        "year": ["2024"],
        "month": ["01"],
        "day": ["01", "02"],
        "time": ["00:00"],
        "data_format": "grib",
        "area": [90, 0, -90, 360],
    }
    c = cdsapi.Client(url="https://cds-beta.climate.copernicus.eu/api")
    c.retrieve(dataset, request, DOWNLOADED_GRIB)

# download_grib()

ds0 = xr.open_dataset(DOWNLOADED_GRIB, engine="cfgrib")
print("Loaded with xarray:")
print(ds0)
print()

refs_for_messages = GribToZarr(DOWNLOADED_GRIB).translate()
assert len(refs_for_messages) == 1
ref = refs_for_messages[0]

fs = fsspec.filesystem("reference", fo=ref)
m = fs.get_mapper("")
ds = xr.open_zarr(m, consolidated=False).load()
print("Loaded with kerchunk:")
print(ds)
Loaded with xarray:
<xarray.Dataset> Size: 8MB
Dimensions:     (time: 2, latitude: 721, longitude: 1440)
Coordinates:
    number      int64 8B ...
  * time        (time) datetime64[ns] 16B 2024-01-01 2024-01-02
    step        timedelta64[ns] 8B ...
    surface     float64 8B ...
  * latitude    (latitude) float64 6kB 90.0 89.75 89.5 ... -89.5 -89.75 -90.0
  * longitude   (longitude) float64 12kB 0.0 0.25 0.5 0.75 ... 359.2 359.5 359.8
    valid_time  (time) datetime64[ns] 16B ...
Data variables:
    t2m         (time, latitude, longitude) float32 8MB ...
Attributes:
    GRIB_edition:            1
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             European Centre for Medium-Range Weather Forecasts
    history:                 2024-07-31T13:59 GRIB to CDM+CF via cfgrib-0.9.1...

Loaded with kerchunk:
<xarray.Dataset> Size: 8MB
Dimensions:     (latitude: 721, longitude: 1440)
Coordinates:
  * latitude    (latitude) float64 6kB 90.0 89.75 89.5 ... -89.5 -89.75 -90.0
  * longitude   (longitude) float64 12kB 0.0 0.25 0.5 0.75 ... 359.2 359.5 359.8
    number      int64 8B 0
    step        timedelta64[ns] 8B 00:00:00
    surface     float64 8B 0.0
    time        datetime64[ns] 8B 2024-01-01
    valid_time  datetime64[ns] 8B 2024-01-01
Data variables:
    t2m         (latitude, longitude) float64 8MB 246.6 246.6 ... 248.0 248.0
Attributes:
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_edition:            1
    GRIB_subCentre:          0
    institution:             European Centre for Medium-Range Weather Forecasts

Note how for the same file when loaded with xarray, time is a dimension of length 2, while with kerchunk it's a coordinate of length 1.

martindurant commented 1 month ago

@mpiannucci , interestingly, gribberish doesn't parse the given file at all, but gives

ValueError: No valid GRIB messages found

This might be an interesting test case for you. I'd say there's a better chance, in the long run, of getting the sub-messages out of gribberish than eccodes (which requires the global "multi" state for reading this).

keltonhalbert commented 1 month ago

Well, I found some time to try and give this a good, long wrestle and I have come to the conclusion that the problem exists within eccodes.codes_new_from_message. See this issue I opened with eccodes-python for more detail.

In short, eccodes.codes_grib_multi_support_on() does not appear to have any influence on eccodes.codes_new_from_message, meaning that streaming bytes with multi-field enabled grib2 messages may not be possible. I am hoping I am either 1) incredibly wrong, and the folks at ECMWF can clarify how to handle this with eccodes, or 2) maybe this can be fixed/addressed and make everyone happy. However, I'm just not sure where to go from here.

Some progress that I made is that scan_grib can now at least recognize the presence of the v10 arrays within a multi-field message, and can be tested in my personal fork. This was only achievable by using the fsspec file handler and passing it to eccodes.codes_new_from_file. The catch is, however, that when decoding/reading the arrays from grib2, the u10 and v10 arrays are identical because Kerchunk doesn't know where the v10 array starts. Presumably, this is because things get handled by codecs.py/GRIBCodec, which expects to be given a grib2 message as a buffer (just like the current version of scan_grib).

It appears to me that changing GRIBCodec to take a fsspec file is undesirable for multiple reasons, especially since byte streaming is kind of the whole point... but I'm at a loss for what else to do if the problem can't be remedied from eccodes side. I'm certainly out of my depth and grasping at straws to make sense of things, so if anyone with more knowledge and insight into the awfulness of the grib2 world can provide a means of encoding the appropriate array offset for the multi-field arrays, please please PLEASE speak up!

Moving forward, I see a few options...

  1. I'm wrong and ECMWF clarifies how to handle multi-field messages from byte streams
  2. The issue lies within eccodes and ECMWF chooses to address it on their end
  3. ECMWF throws their hands up in the air and says sorry, leaving it up to Kerchunk to figure out how to handle things.

If we get stuck with option 3, we need to figure out how to preserve what is effectively byte streaming, while tricking eccodes into thinking it's getting a file. Unfortunately, you cannot just pass eccodes a BytesIO object to achieve the desired behavior. I don't know the core of fsspec very well, but I know it was intended to handle remote file streaming, so perhaps the GRIBCodec class needs to be refactored to use fsspec rather than bytes? Input from the maintainers is appreciated...

Edit: One more idea, is that if we know how to brute-force the byte offsets to read the appropriate array, that could work. Unfortunately, there are no grib2 fields/IDs that broadcast the presence of multi-field messages, at least in the eccodes API and certainly not encoded in the file metadata that I can tell.

martindurant commented 1 month ago

so perhaps the GRIBCodec class needs to be refactored to use fsspec rather than bytes?

From what you have linked, the C code in eccodes wants to use a file descriptor, i.e., real local open file. That means, fsspec can't do it (except to copy the bytes to a local temporary location, RAMdisk or something). How strange that they should have a completely different way to handle a bytes buffer versus a file!

Unfortunately, there are no grib2 fields/IDs that broadcast the presence of multi-field messages

You said you had code to detect this? I don't see from the commit. But if knowing beforehand is enough, we can store the fact in the init parameters for the grib codec.

keltonhalbert commented 1 month ago

@martindurant Sorry for the confusion regarding the commit - that code doesn't explicitly detect the presence of multi-field messages, but by relying on eccodes.codes_new_from_file with the fsspec handler in the version of scan_grib in my commit, there is a way to brute-force the detection.

In short, if you keep track of the mid and offset variables with something like mid_last and offset_last, you can detect a multi-message entry because the mid will remain the same between loop iterations, but offset will change. Hopefully that makes some sense? That's the only way I could figure out to detect it so far.

That said, clearly eccodes.codes_new_from_file is able to detect multi-field grib messages on its own. Perhaps digging into that logic could help provide a means of doing the same for scan_grib and GRIBCodec?

martindurant commented 1 month ago

It sounds like, we can find multi cases, then, but only if we copy messages to disk first. That's totally doable and what scan_grib did do once upon a time. For decoding at read time, well it would work but be pretty annoying! We could advise users to ensure that their temporary storage is memory-based.

In short, if you keep track of the mid and offset variables with something like mid_last and offset_last, you can detect a multi-message entry because the mid will remain the same between loop iterations, but offset will change. Hopefully that makes some sense? That's the only way I could figure out to detect it so far.

Interesting! Would you mind showing this for an example multi file? I wonder if with the two offsets we have enough information to make two non-multi messages for eccodes at runtime.