ecmwf / anemoi-datasets

Apache License 2.0
34 stars 21 forks source link

Access to GCP Storage via it's HTTPS URL #61

Open CSyl opened 1 month ago

CSyl commented 1 month ago

What happened?

When trying to source a zarr from GCP storage via executing anemoi-datasets create config.yaml test.zarr' the following error occurs:

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
2024-09-30 13:15:18 ERROR 
πŸ’£ Expecting value: line 1 column 1 (char 0)
2024-09-30 13:15:18 ERROR πŸ’£ Exiting

I am able to obtain the data from an S3 storage and ECMWF URL via the S3 or HTTPS, and HTTPS, respectively -- However, the capability to access an objects a GCP Storage (e.g. https://console.cloud.google.com/storage/browser/gcp-public-data-arco-era5/ar/1959-2022-1h-360x181_equiangular_with_poles_conservative.zarr) leads to a JSON error. Would it be possible to request a feature to be added for which could accommodate the sourcing of a zarr object from a GCP storage location?

What are the steps to reproduce the bug?

1) Created a configuration file (saved as _test_gcp_zarrhttpsurl.yaml) as such:

dates:
  start: 2021-12-31T09:00:00
  end: 2021-12-31T22:00:00
  frequency: 1h

input:
  xarray-zarr:
    url: "https://console.cloud.google.com/storage/browser/gcp-public-data-arco-era5/ar/1959-2022-1h-360x181_equiangular_with_poles_conservative.zarr"
    param: [2m_temperature,
    10m_u_component_of_wind,
    geopotential,
    10m_v_component_of_wind,
    surface_pressure]

2) Executed: anemoi-datasets create test_gcp_zarr_httpsurl.yaml test_gcp.zarr and obtained the following error:

2024-09-30 13:14:59 INFO Task init((),{}) starting
2024-09-30 13:15:00 INFO Setting flatten_grid=True in config
2024-09-30 13:15:00 INFO Setting ensemble_dimension=2 in config
2024-09-30 13:15:00 INFO Setting flatten_grid=True in config
2024-09-30 13:15:00 INFO Setting ensemble_dimension=2 in config
2024-09-30 13:15:00 INFO {'start': datetime.datetime(2021, 12, 31, 9, 0), 'end': datetime.datetime(2021, 12, 31, 22, 0), 'frequency': '1h', 'group_by': 'monthly'}
2024-09-30 13:15:00 INFO Groups(dates=1)
2024-09-30 13:15:00 INFO FunctionAction: url=https://console.cloud.google.com/storage/browser/gcp-public-data-arco-era5/ar/1959-2022-1h-360x181_equiangular_with_poles_conservative.zarr param=['2m_temperature', '10m_u_component_of_wind', 'geopotential', '10m_v_component_of_wind', 'surface_pressure'] 
2024-09-30 13:15:06 INFO Minimal input for 'init' step (using only the first date) :
2024-09-30 13:15:06 INFO xarray-zarr(['2021-12-31T09:00:00'])
2024-09-30 13:15:06 INFO Config loaded ok:
2024-09-30 13:15:06 INFO Found 14 datetimes.
2024-09-30 13:15:06 INFO Dates: Found 14 datetimes, in 1 groups: 
2024-09-30 13:15:06 INFO Missing dates: 0
2024-09-30 13:15:18 ERROR Error in execute
Traceback (most recent call last):
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/anemoi/datasets/create/input.py", line 590, in datasource
    return _tidy(self.action.function(FunctionContext(self), self.dates, *args, **kwargs))
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/anemoi/datasets/create/functions/sources/xarray_zarr.py", line 15, in execute
    return load_many("πŸ‡Ώ", context, dates, url, *args, **kwargs)
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/anemoi/datasets/create/functions/sources/xarray/__init__.py", line 77, in load_many
    result.append(load_one(emoji, context, dates, path, **kwargs))
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/anemoi/datasets/create/functions/sources/xarray/__init__.py", line 47, in load_one
    data = xr.open_zarr(name_to_zarr_store(dataset), **options)
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/xarray/backends/zarr.py", line 1103, in open_zarr
    ds = open_dataset(
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/xarray/backends/api.py", line 611, in open_dataset
    backend_ds = backend.open_dataset(
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/xarray/backends/zarr.py", line 1173, in open_dataset
    store = ZarrStore.open_group(
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/xarray/backends/zarr.py", line 483, in open_group
    zarr_group, consolidate_on_close, close_store_on_close = _get_open_params(
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/xarray/backends/zarr.py", line 1335, in _get_open_params
    zarr_group = zarr.open_consolidated(store, **open_kwargs)
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/zarr/convenience.py", line 1360, in open_consolidated
    meta_store = ConsolidatedStoreClass(store, metadata_key=metadata_key)
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/zarr/storage.py", line 3046, in __init__
    meta = json_loads(self.store[metadata_key])
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/zarr/util.py", line 76, in json_loads
    return json.loads(ensure_text(s, "utf-8"))
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/anemoi/utils/cli.py", line 135, in cli_main
    cmd.run(args)
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/anemoi/datasets/commands/create.py", line 64, in run
    self.serial_create(args)
  File "..../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/anemoi/datasets/commands/create.py", line 74, in serial_create
    task("init", options)
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/anemoi/datasets/commands/create.py", line 29, in task
    result = c.run()
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/anemoi/datasets/create/__init__.py", line 355, in run
    return self._run()
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/anemoi/datasets/create/__init__.py", line 375, in _run
    variables = self.minimal_input.variables
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/anemoi/datasets/create/input.py", line 484, in variables
    self.build_coords()
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/anemoi/datasets/create/input.py", line 435, in build_coords
    from_data = self.get_cube().user_coords
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/anemoi/datasets/create/input.py", line 230, in get_cube
    ds = self.datasource
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/functools.py", line 981, in __get__
    val = self.func(instance)
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/anemoi/datasets/create/input.py", line 90, in wrapper
    result = method(self, *args, **kwargs)
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/anemoi/datasets/create/template.py", line 26, in wrapper
    result = method(self, *args, **kwargs)
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/anemoi/datasets/create/trace.py", line 56, in wrapper
    result = method(self, *args, **kwargs)
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/anemoi/datasets/create/input.py", line 590, in datasource
    return _tidy(self.action.function(FunctionContext(self), self.dates, *args, **kwargs))
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/anemoi/datasets/create/functions/sources/xarray_zarr.py", line 15, in execute
    return load_many("πŸ‡Ώ", context, dates, url, *args, **kwargs)
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/anemoi/datasets/create/functions/sources/xarray/__init__.py", line 77, in load_many
    result.append(load_one(emoji, context, dates, path, **kwargs))
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/anemoi/datasets/create/functions/sources/xarray/__init__.py", line 47, in load_one
    data = xr.open_zarr(name_to_zarr_store(dataset), **options)
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/xarray/backends/zarr.py", line 1103, in open_zarr
    ds = open_dataset(
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/xarray/backends/api.py", line 611, in open_dataset
    backend_ds = backend.open_dataset(
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/xarray/backends/zarr.py", line 1173, in open_dataset
    store = ZarrStore.open_group(
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/xarray/backends/zarr.py", line 483, in open_group
    zarr_group, consolidate_on_close, close_store_on_close = _get_open_params(
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/xarray/backends/zarr.py", line 1335, in _get_open_params
    zarr_group = zarr.open_consolidated(store, **open_kwargs)
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/zarr/convenience.py", line 1360, in open_consolidated
    meta_store = ConsolidatedStoreClass(store, metadata_key=metadata_key)
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/zarr/storage.py", line 3046, in __init__
    meta = json_loads(self.store[metadata_key])
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/site-packages/zarr/util.py", line 76, in json_loads
    return json.loads(ensure_text(s, "utf-8"))
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File ".../miniconda3/envs/anemoi_test/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
**json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)**
2024-09-30 13:15:18 ERROR 
πŸ’£ Expecting value: line 1 column 1 (char 0)
2024-09-30 13:15:18 ERROR πŸ’£ Exiting

Version

0.5.0

Platform (OS and architecture)

Linux

Relevant log output

No response

Accompanying data

No response

Organisation

No response

b8raoult commented 1 month ago

This is not a problem with anemoi-datasets. This will also fail:

import xarray as xr

xr.open_zarr('https://console.cloud.google.com/storage/browser/gcp-public-data-arco-era5/ar/1959-2022-1h-360x181_equiangular_with_poles_conservative.zarr')

The URL is not correct. The correct URL is gs://gcp-public-data-arco-era5/ar/1959-2022-1h-360x181_equiangular_with_poles_conservative.zarr

CSyl commented 1 month ago

Hi @b8raoult , thank you for your response. Much appreciated!. Yes, I agree with you that with the xr.open_zarr() I am able to open the zarr with the URL you mentioned above (gs://gcp-public-data-arco-era5/ar/1959-2022-1h-360x181_equiangular_with_poles_conservative.zarr), however when running the following python script to call the functions in the anemoi-datasets develop source code, I am obtaining the following:

1) Python script executed:

from anemoi.datasets.data import add_dataset_path, open_dataset
add_dataset_path("gs://gcp-public-data-arco-era5/ar/")
ds = open_dataset("1959-2022-1h-360x181_equiangular_with_poles_conservative")

Error message:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/miniconda3/envs/ai_pipeline/lib/python3.10/site-packages/zarr/hierarchy.py:520, in Group.__getattr__(self, item)
    519 try:
--> 520     return self.__getitem__(item)
    521 except KeyError:

File ~/miniconda3/envs/ai_pipeline/lib/python3.10/site-packages/zarr/hierarchy.py:500, in Group.__getitem__(self, item)
    499 else:
--> 500     raise KeyError(item)

KeyError: 'data'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
Cell In[1], line 6
      3 add_dataset_path("gs://gcp-public-data-arco-era5/ar/")
      5 # Opening entire dataset w/out filter.
----> 6 ds = open_dataset("1959-2022-1h-360x181_equiangular_with_poles_conservative")
      7 ds

File .../src/anemoi/datasets/data/__init__.py:29, in open_dataset(*args, **kwargs)
     28 def open_dataset(*args, **kwargs):
---> 29     ds = _open_dataset(*args, **kwargs)
     30     ds = ds.mutate()
     31     ds.arguments = {"args": args, "kwargs": kwargs}

File .../src/anemoi/datasets/data/misc.py:267, in _open_dataset(*args, **kwargs)
    265 sets = []
    266 for a in args:
--> 267     sets.append(_open(a))
    269 if "xy" in kwargs:
    270     from .xy import xy_factory

File .../src/anemoi/datasets/data/misc.py:180, in _open(a)
    177     return Zarr(a).mutate()
    179 if isinstance(a, str):
--> 180     return Zarr(zarr_lookup(a)).mutate()
    182 if isinstance(a, PurePath):
    183     return _open(str(a)).mutate()

File .../src/anemoi/datasets/data/stores.py:167, in Zarr.__init__(self, path)
    164     self.z = open_zarr(self.path)
    166 # This seems to speed up the reading of the data a lot
--> 167 self.data = self.z.data
    168 self.missing = set()

File ~/miniconda3/envs/ai_pipeline/lib/python3.10/site-packages/zarr/hierarchy.py:522, in Group.__getattr__(self, item)
    520     return self.__getitem__(item)
    521 except KeyError:
--> 522     raise AttributeError

AttributeError: 

Now, the above error does not occur IF I were to open up an object from an S3 bucket or ECMWF object store (e.g. https://object-store.os-api.cci1.ecmwf.int/ml-examples). For example, when sourcing from the ECMWF object store:

from anemoi.datasets.data import add_dataset_path, open_dataset
add_dataset_path("https://object-store.os-api.cci1.ecmwf.int/ml-examples/")
ds = open_dataset("an-oper-2023-2023-2p5-6h-v1")

Result is the zarr located in ecmwf's object store will load without the above error that I got when trying to source from GS storage.

b8raoult commented 1 month ago

Yes, this is the zarr you put in the YAML file.

dates:
  start: 2021-12-31T09:00:00
  end: 2021-12-31T22:00:00
  frequency: 1h

input:
  xarray-zarr:
    url: "gs://gcp-public-data-arco-era5/ar/1959-2022-1h-360x181_equiangular_with_poles_conservative.zarr"
    param: [2m_temperature,
    10m_u_component_of_wind,
    geopotential,
    10m_v_component_of_wind,
    surface_pressure]
CSyl commented 1 month ago

Hi @b8raoult, I have tried setting up the YAML file (gcp-gsurl-sample-zarr.yaml) as such:

dates:
  start: 2021-12-31T09:00:00
  end: 2021-12-31T10:00:00
  frequency: 1h

input:
  xarray-zarr:
    url: "gs://gcp-public-data-arco-era5/ar/1959-2022-1h-360x181_equiangular_with_poles_conservative.zarr"
    param: [2m_temperature,
    10m_u_component_of_wind,
    geopotential,
    10m_v_component_of_wind,
    surface_pressure]

In this case, I then ran the latest release of anemoi-datasets (v0.5.6) via:

anemoi-datasets create gcp-gsurl-sample-zarr.yaml test.zarr

& the following error will also occur:

2024-10-04 10:02:01 INFO 🎬 Task init((),{}) starting
2024-10-04 10:02:02 INFO Setting flatten_grid=True in config
2024-10-04 10:02:02 INFO Setting ensemble_dimension=2 in config
2024-10-04 10:02:02 INFO Setting flatten_grid=True in config
2024-10-04 10:02:02 INFO Setting ensemble_dimension=2 in config
2024-10-04 10:02:02 INFO {'start': datetime.datetime(2021, 12, 31, 9, 0), 'end': datetime.datetime(2021, 12, 31, 10, 0), 'frequency': '1h', 'group_by': 'monthly'}
2024-10-04 10:02:02 INFO Groups(dates=1,<anemoi.datasets.dates.StartEndDates object at 0x7fea1befa6b0>)
2024-10-04 10:02:02 INFO FunctionAction: url=gs://gcp-public-data-arco-era5/ar/1959-2022-1h-360x181_equiangular_with_poles_conservative.zarr param=['2m_temperature', '10m_u_component_of_wind', 'geopotential', '10m_v_component_of_wind', 'surface_pressure'] 
2024-10-04 10:02:04 INFO Groups: Groups(dates=1,<anemoi.datasets.dates.StartEndDates object at 0x7fea1befa6b0>)
2024-10-04 10:02:07 INFO Minimal input for 'init' step (using only the first date) : GroupOfDates(dates=['2021-12-31T09:00:00'])
2024-10-04 10:02:07 INFO xarray-zarr(GroupOfDates(dates=['2021-12-31T09:00:00']))
2024-10-04 10:02:07 INFO Config loaded ok:
2024-10-04 10:02:07 INFO Found 2 datetimes.
2024-10-04 10:02:07 INFO Dates: Found 2 datetimes, in 1 groups: 
2024-10-04 10:02:07 INFO Missing dates: 0
2024-10-04 10:02:17 WARNING Compute Engine Metadata server unavailable on attempt 1 of 3. Reason: timed out
2024-10-04 10:02:21 WARNING Compute Engine Metadata server unavailable on attempt 2 of 3. Reason: timed out
2024-10-04 10:02:26 WARNING Compute Engine Metadata server unavailable on attempt 3 of 3. Reason: timed out
2024-10-04 10:02:26 WARNING Authentication failed using Compute Engine authentication due to unavailable metadata server.
2024-10-04 10:02:26 WARNING Compute Engine Metadata server unavailable on attempt 1 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7fea18e92a40>: Failed to resolve 'metadata.google.internal' ([Errno -2] Name or service not known)"))
2024-10-04 10:02:27 WARNING Compute Engine Metadata server unavailable on attempt 2 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7fea18e93850>: Failed to resolve 'metadata.google.internal' ([Errno -2] Name or service not known)"))
2024-10-04 10:02:29 WARNING Compute Engine Metadata server unavailable on attempt 3 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7fea18e93c40>: Failed to resolve 'metadata.google.internal' ([Errno -2] Name or service not known)"))
2024-10-04 10:02:33 WARNING Compute Engine Metadata server unavailable on attempt 4 of 5. Reason:
b8raoult commented 1 month ago

That's OK. I got that warning message as well. It will eventually finish.

CSyl commented 1 month ago

Hi @b8raoult, Is there an intermediate step required to get around connecting to the metadata server? At the moment, when the latest release of anemoi-datasets (v0.5.6) is ran with the aforementioned configuration file via:

anemoi-datasets create gcp-gsurl-sample-zarr.yaml test.zarr

The framework gets hung up & stays at the series of messages of "WARNING Compute Engine Metadata server unavailable on attempt" & does not progress forward after 1hr of wait time. What gets generated is a zarr, test.zarr, with an empty _build folder