earth-mover / icechunk

Open-source, cloud-native transactional tensor storage engine
https://icechunk.io
Apache License 2.0
291 stars 17 forks source link

Invalid Datatype ('>f8') when trying to convert kerchunk reference to icechunk reference. #367

Closed jbusecke closed 2 weeks ago

jbusecke commented 3 weeks ago

Hey folks,

I am trying to prepare some examples of how to virtualize CMIP6 netcdf files (together with @norlandrhagen) and ran into the issue below.

I was trying to do the following:

This seems to work ok with netcdf4 files but I am running into an issue with a particular netcdf 3 dataset.

urls = [
 'http://esgf3.dkrz.de/thredds/fileServer/cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/6hrLev/ps/gn/v20181127/ps_6hrLev_BCC-CSM2-MR_historical_r1i1p1f1_gn_195001010000-195412311800.nc',
 'http://esgf3.dkrz.de/thredds/fileServer/cmip6/CMIP/BCC/BCC-CSM2-MR/historical/r1i1p1f1/6hrLev/ps/gn/v20181127/ps_6hrLev_BCC-CSM2-MR_historical_r1i1p1f1_gn_195501010000-195912311800.nc',
]

I ran the following on the latest pangeo-notebook image (2024.10.26)

import dask
import xarray as xr
from virtualizarr import open_virtual_dataset
from dask.distributed import Client

client = Client(n_workers=16)
client

def _process_single_file(filename, reader_options):
    vds = open_virtual_dataset(filename,  indexes={}, reader_options=reader_options)
    return vds

def generate_kerchunk_refs_with_delayed(url_lst, reader_options=None):
    delayed_results = [dask.delayed(_process_single_file)(filename, reader_options) for filename in urls]

    # compute delayed obs

    results = dask.compute(*delayed_results)

    # concat virtual datasets
    combined_vds = xr.concat(list(results), dim="time", coords="minimal", compat="override")
    return combined_vds

vds = generate_kerchunk_refs_with_delayed(urls)

vds.virtualize.to_kerchunk(
            'my_little_ref.parquet', format='parquet'
)

I then installed git+https://github.com/zarr-developers/VirtualiZarr and git+https://github.com/zarr-developers/VirtualiZarr@RT_kerchunk_bug and did the following:

import xarray as xr 
from virtualizarr import open_virtual_dataset
from icechunk import IcechunkStore, StorageConfig, StoreConfig, VirtualRefConfig, S3Credentials

vds = open_virtual_dataset( 'my_little_ref.parquet', filetype = 'kerchunk', indexes={})

storage = StorageConfig.filesystem('my_little_ref.icechunk')

store = IcechunkStore.create(
    storage=storage,
    config=StoreConfig(
        virtual_ref_config=VirtualRefConfig.s3_anonymous(region='us-east-2'),
    )
)
vds
<xarray.Dataset> Size: 20GB
Dimensions:    (time: 94900, bnds: 2, lat: 160, lon: 320)
Coordinates:
    lon        (lon) >f8 3kB ManifestArray<shape=(320,), dtype=>f8, chunks=(3...
    time       (time) >f8 759kB ManifestArray<shape=(94900,), dtype=>f8, chun...
    lat        (lat) >f8 1kB ManifestArray<shape=(160,), dtype=>f8, chunks=(1...
Dimensions without coordinates: bnds
Data variables:
    time_bnds  (time, bnds) >f8 2MB ManifestArray<shape=(94900, 2), dtype=>f8...
    lat_bnds   (time, lat, bnds) >f8 243MB ManifestArray<shape=(94900, 160, 2...
    lon_bnds   (time, lon, bnds) >f8 486MB ManifestArray<shape=(94900, 320, 2...
    ps         (time, lat, lon) >f4 19GB ManifestArray<shape=(94900, 160, 320...
Attributes: (12/49)
    Conventions:            CF-1.7 CMIP-6.2
    activity_id:            CMIP
    branch_method:          Standard
    branch_time_in_child:   0.0
    branch_time_in_parent:  2289.0
    cmor_version:           3.3.2
    ...                     ...
    table_id:               6hrLev
    table_info:             Creation Date:(30 July 2018) MD5:e53ff52009d0b97d...
    title:                  BCC-CSM2-MR output prepared for CMIP6
    tracking_id:            hdl:21.14100/9f9682b7-829e-429b-a4be-758bcb445bb4
    variable_id:            ps
    variant_label:          r1i1p1f1

But when I run

vds.virtualize.to_icechunk(store)

I am getting the following error (full trace below):

File /srv/conda/envs/notebook/lib/python3.12/site-packages/zarr/core/metadata/v3.py:627, in DataType.parse(cls, dtype)
    625     data_type = DataType.from_numpy(dtype)
    626 except KeyError as e:
--> 627     raise ValueError(f"Invalid V3 data_type: {dtype}") from e
    628 return data_type

ValueError: Invalid V3 data_type: >f8

This seems to originate here and has something to do with the 'endianess' of the data.

Just wanted to flag this here, and was wondering if there is a workaround for this. It would be amazing to have both netcdf4 and netcdf3 working as a demonstration.

``` --------------------------------------------------------------------------- KeyError Traceback (most recent call last) File /srv/conda/envs/notebook/lib/python3.12/site-packages/zarr/core/metadata/v3.py:625, in DataType.parse(cls, dtype) 624 try: --> 625 data_type = DataType.from_numpy(dtype) 626 except KeyError as e: File /srv/conda/envs/notebook/lib/python3.12/site-packages/zarr/core/metadata/v3.py:607, in DataType.from_numpy(cls, dtype) 590 dtype_to_data_type = { 591 "|b1": "bool", 592 "bool": "bool", (...) 605 " 607 return DataType[dtype_to_data_type[dtype.str]] KeyError: '>f8' The above exception was the direct cause of the following exception: ValueError Traceback (most recent call last) Cell In[1], line 16 7 storage = StorageConfig.filesystem('../refs_new/ref_http_nc3.icechunk') 9 store = IcechunkStore.create( 10 storage=storage, 11 config=StoreConfig( 12 virtual_ref_config=VirtualRefConfig.s3_anonymous(region='us-east-2'), 13 ) 14 ) ---> 16 vds.virtualize.to_icechunk(store) File /srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/accessor.py:58, in VirtualiZarrDatasetAccessor.to_icechunk(self, store) 47 """ 48 Write an xarray dataset to an Icechunk store. 49 (...) 54 store: IcechunkStore 55 """ 56 from virtualizarr.writers.icechunk import dataset_to_icechunk ---> 58 dataset_to_icechunk(self.ds, store) File /srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/writers/icechunk.py:61, in dataset_to_icechunk(ds, store) 54 root_group = Group.from_store(store=store) 56 # TODO this is Frozen, the API for setting attributes must be something else 57 # root_group.attrs = ds.attrs 58 # for k, v in ds.attrs.items(): 59 # root_group.attrs[k] = encode_zarr_attr_value(v) ---> 61 return write_variables_to_icechunk_group( 62 ds.variables, 63 ds.attrs, 64 store=store, 65 group=root_group, 66 ) File /srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/writers/icechunk.py:94, in write_variables_to_icechunk_group(variables, attrs, store, group) 92 # Then finish by writing the virtual variables to the same group 93 for name, var in virtual_variables.items(): ---> 94 write_virtual_variable_to_icechunk( 95 store=store, 96 group=group, 97 name=name, 98 var=var, 99 ) File /srv/conda/envs/notebook/lib/python3.12/site-packages/virtualizarr/writers/icechunk.py:133, in write_virtual_variable_to_icechunk(store, group, name, var) 130 zarray = ma.zarray 132 # creates array if it doesn't already exist --> 133 arr = group.require_array( 134 name=name, 135 shape=zarray.shape, 136 chunk_shape=zarray.chunks, 137 dtype=encode_dtype(zarray.dtype), 138 codecs=zarray._v3_codec_pipeline(), 139 dimension_names=var.dims, 140 fill_value=zarray.fill_value, 141 # TODO fill_value? 142 ) 144 # TODO it would be nice if we could assign directly to the .attrs property 145 for k, v in var.attrs.items(): File /srv/conda/envs/notebook/lib/python3.12/site-packages/zarr/core/group.py:1705, in Group.require_array(self, name, **kwargs) 1683 def require_array(self, name: str, **kwargs: Any) -> Array: 1684 """Obtain an array, creating if it doesn't exist. 1685 1686 (...) 1703 a : Array 1704 """ -> 1705 return Array(self._sync(self._async_group.require_array(name, **kwargs))) File /srv/conda/envs/notebook/lib/python3.12/site-packages/zarr/core/sync.py:185, in SyncMixin._sync(self, coroutine) 182 def _sync(self, coroutine: Coroutine[Any, Any, T]) -> T: 183 # TODO: refactor this to to take *args and **kwargs and pass those to the method 184 # this should allow us to better type the sync wrapper --> 185 return sync( 186 coroutine, 187 timeout=config.get("async.timeout"), 188 ) File /srv/conda/envs/notebook/lib/python3.12/site-packages/zarr/core/sync.py:141, in sync(coro, loop, timeout) 138 return_result = next(iter(finished)).result() 140 if isinstance(return_result, BaseException): --> 141 raise return_result 142 else: 143 return return_result File /srv/conda/envs/notebook/lib/python3.12/site-packages/zarr/core/sync.py:100, in _runner(coro) 95 """ 96 Await a coroutine and return the result of running it. If awaiting the coroutine raises an 97 exception, the exception will be returned. 98 """ 99 try: --> 100 return await coro 101 except Exception as ex: 102 return ex File /srv/conda/envs/notebook/lib/python3.12/site-packages/zarr/core/group.py:1064, in AsyncGroup.require_array(self, name, shape, dtype, exact, **kwargs) 1062 raise TypeError(f"Incompatible dtype ({ds.dtype} vs {dtype})") 1063 except KeyError: -> 1064 ds = await self.create_array(name, shape=shape, dtype=dtype, **kwargs) 1066 return ds File /srv/conda/envs/notebook/lib/python3.12/site-packages/zarr/core/group.py:935, in AsyncGroup.create_array(self, name, shape, dtype, fill_value, attributes, chunk_shape, chunk_key_encoding, codecs, dimension_names, chunks, dimension_separator, order, filters, compressor, exists_ok, data) 866 async def create_array( 867 self, 868 name: str, (...) 892 data: npt.ArrayLike | None = None, 893 ) -> AsyncArray[ArrayV2Metadata] | AsyncArray[ArrayV3Metadata]: 894 """ 895 Create a Zarr array within this AsyncGroup. 896 This method lightly wraps AsyncArray.create. (...) 933 934 """ --> 935 return await AsyncArray.create( 936 self.store_path / name, 937 shape=shape, 938 dtype=dtype, 939 chunk_shape=chunk_shape, 940 fill_value=fill_value, 941 chunk_key_encoding=chunk_key_encoding, 942 codecs=codecs, 943 dimension_names=dimension_names, 944 attributes=attributes, 945 chunks=chunks, 946 dimension_separator=dimension_separator, 947 order=order, 948 filters=filters, 949 compressor=compressor, 950 exists_ok=exists_ok, 951 zarr_format=self.metadata.zarr_format, 952 data=data, 953 ) File /srv/conda/envs/notebook/lib/python3.12/site-packages/zarr/core/array.py:487, in AsyncArray.create(cls, store, shape, dtype, zarr_format, fill_value, attributes, chunk_shape, chunk_key_encoding, codecs, dimension_names, chunks, dimension_separator, order, filters, compressor, exists_ok, data) 483 if compressor is not None: 484 raise ValueError( 485 "compressor cannot be used for arrays with version 3. Use bytes-to-bytes codecs instead." 486 ) --> 487 result = await cls._create_v3( 488 store_path, 489 shape=shape, 490 dtype=dtype_parsed, 491 chunk_shape=_chunks, 492 fill_value=fill_value, 493 chunk_key_encoding=chunk_key_encoding, 494 codecs=codecs, 495 dimension_names=dimension_names, 496 attributes=attributes, 497 exists_ok=exists_ok, 498 ) 499 elif zarr_format == 2: 500 if dtype is str or dtype == "str": 501 # another special case: zarr v2 added the vlen-utf8 codec File /srv/conda/envs/notebook/lib/python3.12/site-packages/zarr/core/array.py:581, in AsyncArray._create_v3(cls, store_path, shape, dtype, chunk_shape, fill_value, chunk_key_encoding, codecs, dimension_names, attributes, exists_ok) 574 if isinstance(chunk_key_encoding, tuple): 575 chunk_key_encoding = ( 576 V2ChunkKeyEncoding(separator=chunk_key_encoding[1]) 577 if chunk_key_encoding[0] == "v2" 578 else DefaultChunkKeyEncoding(separator=chunk_key_encoding[1]) 579 ) --> 581 metadata = ArrayV3Metadata( 582 shape=shape, 583 data_type=dtype, 584 chunk_grid=RegularChunkGrid(chunk_shape=chunk_shape), 585 chunk_key_encoding=chunk_key_encoding, 586 fill_value=fill_value, 587 codecs=codecs, 588 dimension_names=tuple(dimension_names) if dimension_names else None, 589 attributes=attributes or {}, 590 ) 592 array = cls(metadata=metadata, store_path=store_path) 593 await array._save_metadata(metadata, ensure_parents=True) File /srv/conda/envs/notebook/lib/python3.12/site-packages/zarr/core/metadata/v3.py:228, in ArrayV3Metadata.__init__(self, shape, data_type, chunk_grid, chunk_key_encoding, fill_value, codecs, attributes, dimension_names, storage_transformers) 224 """ 225 Because the class is a frozen dataclass, we set attributes using object.__setattr__ 226 """ 227 shape_parsed = parse_shapelike(shape) --> 228 data_type_parsed = DataType.parse(data_type) 229 chunk_grid_parsed = ChunkGrid.from_dict(chunk_grid) 230 chunk_key_encoding_parsed = ChunkKeyEncoding.from_dict(chunk_key_encoding) File /srv/conda/envs/notebook/lib/python3.12/site-packages/zarr/core/metadata/v3.py:627, in DataType.parse(cls, dtype) 625 data_type = DataType.from_numpy(dtype) 626 except KeyError as e: --> 627 raise ValueError(f"Invalid V3 data_type: {dtype}") from e 628 return data_type ValueError: Invalid V3 data_type: >f8 ```
rabernat commented 3 weeks ago

Thanks Julius. This is a known issue with Zarr V3 (https://github.com/zarr-developers/zarr-python/issues/2324) and is not Icechunk specific.

jbusecke commented 2 weeks ago

Ah thanks, I was not quite sure where to raise. Off to hunting for another dataset in the CMIP catalog weee.