Open CSyl opened 1 month ago
This is not a problem with anemoi-datasets
. This will also fail:
import xarray as xr
xr.open_zarr('https://console.cloud.google.com/storage/browser/gcp-public-data-arco-era5/ar/1959-2022-1h-360x181_equiangular_with_poles_conservative.zarr')
The URL is not correct. The correct URL is gs://gcp-public-data-arco-era5/ar/1959-2022-1h-360x181_equiangular_with_poles_conservative.zarr
Hi @b8raoult , thank you for your response. Much appreciated!. Yes, I agree with you that with the xr.open_zarr() I am able to open the zarr with the URL you mentioned above (gs://gcp-public-data-arco-era5/ar/1959-2022-1h-360x181_equiangular_with_poles_conservative.zarr), however when running the following python script to call the functions in the anemoi-datasets develop source code, I am obtaining the following:
1) Python script executed:
from anemoi.datasets.data import add_dataset_path, open_dataset
add_dataset_path("gs://gcp-public-data-arco-era5/ar/")
ds = open_dataset("1959-2022-1h-360x181_equiangular_with_poles_conservative")
Error message:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/miniconda3/envs/ai_pipeline/lib/python3.10/site-packages/zarr/hierarchy.py:520, in Group.__getattr__(self, item)
519 try:
--> 520 return self.__getitem__(item)
521 except KeyError:
File ~/miniconda3/envs/ai_pipeline/lib/python3.10/site-packages/zarr/hierarchy.py:500, in Group.__getitem__(self, item)
499 else:
--> 500 raise KeyError(item)
KeyError: 'data'
During handling of the above exception, another exception occurred:
AttributeError Traceback (most recent call last)
Cell In[1], line 6
3 add_dataset_path("gs://gcp-public-data-arco-era5/ar/")
5 # Opening entire dataset w/out filter.
----> 6 ds = open_dataset("1959-2022-1h-360x181_equiangular_with_poles_conservative")
7 ds
File .../src/anemoi/datasets/data/__init__.py:29, in open_dataset(*args, **kwargs)
28 def open_dataset(*args, **kwargs):
---> 29 ds = _open_dataset(*args, **kwargs)
30 ds = ds.mutate()
31 ds.arguments = {"args": args, "kwargs": kwargs}
File .../src/anemoi/datasets/data/misc.py:267, in _open_dataset(*args, **kwargs)
265 sets = []
266 for a in args:
--> 267 sets.append(_open(a))
269 if "xy" in kwargs:
270 from .xy import xy_factory
File .../src/anemoi/datasets/data/misc.py:180, in _open(a)
177 return Zarr(a).mutate()
179 if isinstance(a, str):
--> 180 return Zarr(zarr_lookup(a)).mutate()
182 if isinstance(a, PurePath):
183 return _open(str(a)).mutate()
File .../src/anemoi/datasets/data/stores.py:167, in Zarr.__init__(self, path)
164 self.z = open_zarr(self.path)
166 # This seems to speed up the reading of the data a lot
--> 167 self.data = self.z.data
168 self.missing = set()
File ~/miniconda3/envs/ai_pipeline/lib/python3.10/site-packages/zarr/hierarchy.py:522, in Group.__getattr__(self, item)
520 return self.__getitem__(item)
521 except KeyError:
--> 522 raise AttributeError
AttributeError:
Now, the above error does not occur IF I were to open up an object from an S3 bucket or ECMWF object store (e.g. https://object-store.os-api.cci1.ecmwf.int/ml-examples). For example, when sourcing from the ECMWF object store:
from anemoi.datasets.data import add_dataset_path, open_dataset
add_dataset_path("https://object-store.os-api.cci1.ecmwf.int/ml-examples/")
ds = open_dataset("an-oper-2023-2023-2p5-6h-v1")
Result is the zarr located in ecmwf's object store will load without the above error that I got when trying to source from GS storage.
Yes, this is the zarr you put in the YAML file.
dates:
start: 2021-12-31T09:00:00
end: 2021-12-31T22:00:00
frequency: 1h
input:
xarray-zarr:
url: "gs://gcp-public-data-arco-era5/ar/1959-2022-1h-360x181_equiangular_with_poles_conservative.zarr"
param: [2m_temperature,
10m_u_component_of_wind,
geopotential,
10m_v_component_of_wind,
surface_pressure]
Hi @b8raoult, I have tried setting up the YAML file (gcp-gsurl-sample-zarr.yaml) as such:
dates:
start: 2021-12-31T09:00:00
end: 2021-12-31T10:00:00
frequency: 1h
input:
xarray-zarr:
url: "gs://gcp-public-data-arco-era5/ar/1959-2022-1h-360x181_equiangular_with_poles_conservative.zarr"
param: [2m_temperature,
10m_u_component_of_wind,
geopotential,
10m_v_component_of_wind,
surface_pressure]
In this case, I then ran the latest release of anemoi-datasets (v0.5.6) via:
anemoi-datasets create gcp-gsurl-sample-zarr.yaml test.zarr
& the following error will also occur:
2024-10-04 10:02:01 INFO π¬ Task init((),{}) starting
2024-10-04 10:02:02 INFO Setting flatten_grid=True in config
2024-10-04 10:02:02 INFO Setting ensemble_dimension=2 in config
2024-10-04 10:02:02 INFO Setting flatten_grid=True in config
2024-10-04 10:02:02 INFO Setting ensemble_dimension=2 in config
2024-10-04 10:02:02 INFO {'start': datetime.datetime(2021, 12, 31, 9, 0), 'end': datetime.datetime(2021, 12, 31, 10, 0), 'frequency': '1h', 'group_by': 'monthly'}
2024-10-04 10:02:02 INFO Groups(dates=1,<anemoi.datasets.dates.StartEndDates object at 0x7fea1befa6b0>)
2024-10-04 10:02:02 INFO FunctionAction: url=gs://gcp-public-data-arco-era5/ar/1959-2022-1h-360x181_equiangular_with_poles_conservative.zarr param=['2m_temperature', '10m_u_component_of_wind', 'geopotential', '10m_v_component_of_wind', 'surface_pressure']
2024-10-04 10:02:04 INFO Groups: Groups(dates=1,<anemoi.datasets.dates.StartEndDates object at 0x7fea1befa6b0>)
2024-10-04 10:02:07 INFO Minimal input for 'init' step (using only the first date) : GroupOfDates(dates=['2021-12-31T09:00:00'])
2024-10-04 10:02:07 INFO xarray-zarr(GroupOfDates(dates=['2021-12-31T09:00:00']))
2024-10-04 10:02:07 INFO Config loaded ok:
2024-10-04 10:02:07 INFO Found 2 datetimes.
2024-10-04 10:02:07 INFO Dates: Found 2 datetimes, in 1 groups:
2024-10-04 10:02:07 INFO Missing dates: 0
2024-10-04 10:02:17 WARNING Compute Engine Metadata server unavailable on attempt 1 of 3. Reason: timed out
2024-10-04 10:02:21 WARNING Compute Engine Metadata server unavailable on attempt 2 of 3. Reason: timed out
2024-10-04 10:02:26 WARNING Compute Engine Metadata server unavailable on attempt 3 of 3. Reason: timed out
2024-10-04 10:02:26 WARNING Authentication failed using Compute Engine authentication due to unavailable metadata server.
2024-10-04 10:02:26 WARNING Compute Engine Metadata server unavailable on attempt 1 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7fea18e92a40>: Failed to resolve 'metadata.google.internal' ([Errno -2] Name or service not known)"))
2024-10-04 10:02:27 WARNING Compute Engine Metadata server unavailable on attempt 2 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7fea18e93850>: Failed to resolve 'metadata.google.internal' ([Errno -2] Name or service not known)"))
2024-10-04 10:02:29 WARNING Compute Engine Metadata server unavailable on attempt 3 of 5. Reason: HTTPConnectionPool(host='metadata.google.internal', port=80): Max retries exceeded with url: /computeMetadata/v1/instance/service-accounts/default/?recursive=true (Caused by NameResolutionError("<urllib3.connection.HTTPConnection object at 0x7fea18e93c40>: Failed to resolve 'metadata.google.internal' ([Errno -2] Name or service not known)"))
2024-10-04 10:02:33 WARNING Compute Engine Metadata server unavailable on attempt 4 of 5. Reason:
That's OK. I got that warning message as well. It will eventually finish.
Hi @b8raoult, Is there an intermediate step required to get around connecting to the metadata server? At the moment, when the latest release of anemoi-datasets (v0.5.6) is ran with the aforementioned configuration file via:
anemoi-datasets create gcp-gsurl-sample-zarr.yaml test.zarr
The framework gets hung up & stays at the series of messages of "WARNING Compute Engine Metadata server unavailable on attempt" & does not progress forward after 1hr of wait time. What gets generated is a zarr, test.zarr, with an empty _build folder
What happened?
When trying to source a zarr from GCP storage via executing
anemoi-datasets
create config.yaml test.zarr' the following error occurs:I am able to obtain the data from an S3 storage and ECMWF URL via the S3 or HTTPS, and HTTPS, respectively -- However, the capability to access an objects a GCP Storage (e.g. https://console.cloud.google.com/storage/browser/gcp-public-data-arco-era5/ar/1959-2022-1h-360x181_equiangular_with_poles_conservative.zarr) leads to a JSON error. Would it be possible to request a feature to be added for which could accommodate the sourcing of a zarr object from a GCP storage location?
What are the steps to reproduce the bug?
1) Created a configuration file (saved as _test_gcp_zarrhttpsurl.yaml) as such:
2) Executed:
anemoi-datasets create test_gcp_zarr_httpsurl.yaml test_gcp.zarr
and obtained the following error:Version
0.5.0
Platform (OS and architecture)
Linux
Relevant log output
No response
Accompanying data
No response
Organisation
No response