ecmwf / anemoi-datasets

Apache License 2.0
34 stars 21 forks source link

Problem building dataset from NetCDF files #7

Open jbanomedina opened 4 months ago

jbanomedina commented 4 months ago

What happened?

My goal is to build a dataset from NetCDF files using the anemoi-datasets library. However, I get an error when using NetCDF files as the source. I have tried both version 0.4.0 (installed using pip) and the develop branch (installed by cloning the repository). I was able to successfully build a dataset from a grib file, however for my project I have the data on the NetCDF format.

What are the steps to reproduce the bug?

Code needed to reproduce this error is the following. 1) First, I download a sample NetCDF file from the CDS using a python script.

import cdsapi
## Define parameters
vars=['10m_u_component_of_wind', '10m_v_component_of_wind']
year=2013
###
c=cdsapi.Client()
c.retrieve(
    'reanalysis-era5-single-levels',
    {
        'product_type': 'reanalysis',
        'format': 'netcdf',
        'variable': vars,
        'year': year,
        'month': [
            '01',
        ],
        'day': [
            '01', '02',
        ],
        'time': [
            '00:00', '06:00', '12:00', '18:00',
        ],
    },
    './sample.nc')

2) Second, I point to this sample in the recipe.yaml file.

dates:
  start: 2013-01-01T00:00:00
  end: 2013-01-01T06:00:00
  frequency: 6h
input:
  netcdf:
    path: ./sample.nc
    param: [u10,v10] # I tried also [10u,10v] 
    levtype: sfc

3) Type this in the command line:

anemoi-datasets create recipe.yaml dataset.zarr

Version

v0.4.0

Platform (OS and architecture)

Linux exp-18-17 4.18.0-513.24.1.el8_9.x86_64 #1 SMP Thu Apr 4 18:13:02 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Relevant log output

Setting flatten_grid=True in config
Setting ensemble_dimension=2 in config
Setting flatten_grid=True in config
Setting ensemble_dimension=2 in config
2024-07-16 14:42:59 INFO {'start': datetime.datetime(2013, 1, 1, 0, 0, tzinfo=datetime.timezone.utc), 'end': datetime.datetime(2013, 1, 1, 6, 0, tzinfo=datetime.timezone.utc), 'frequency': '6h', 'group_by': 'monthly'}
2024-07-16 14:42:59 INFO <anemoi.datasets.dates.groups.Groups object at 0x155147fbcee0>
2024-07-16 14:42:59 INFO ✅ INPUT_BUILDER
2024-07-16 14:42:59 INFO FunctionAction: path=./sample.nc param=['u10', 'v10'] levtype=sfc 
2024-07-16 14:42:59 INFO FunctionAction: path=./sample.nc param=['u10', 'v10'] levtype=sfc 
2024-07-16 14:42:59 INFO Minimal input (using only the first date) :
2024-07-16 14:42:59 INFO netcdf(['2013-01-01T00:00:00'])
Config loaded ok:
2024-07-16 14:42:59 INFO {'config_path': '/expanse/nfs/cw3e/cwp167/projects/test-attribution/recipe.yaml', 'dates': {'start': datetime.datetime(2013, 1, 1, 0, 0, tzinfo=datetime.timezone.utc), 'end': datetime.datetime(2013, 1, 1, 6, 0, tzinfo=datetime.timezone.utc), 'frequency': '6h', 'group_by': 'monthly'}, 'input': {'netcdf': {'path': './sample.nc', 'param': ['u10', 'v10'], 'levtype': 'sfc'}}, 'dataset_status': 'experimental', 'description': 'No description provided.', 'licence': 'unknown', 'attribution': 'unknown', 'build': {'group_by': 'monthly', 'use_grib_paramid': False, 'variable_naming': 'default'}, 'output': {'order_by': {'valid_datetime': 'ascending', 'param_level': 'ascending', 'number': 'ascending'}, 'remapping': {'param_level': '{param}_{levelist}'}, 'statistics': 'param_level', 'chunking': {'dates': 1, 'ensembles': 1}, 'dtype': 'float32', 'flatten_grid': True, 'ensemble_dimension': 2}, 'statistics': {}, 'reading_chunks': None}
Found 2 datetimes.
2024-07-16 14:42:59 INFO Dates: Found 2 datetimes, in 1 groups: 
2024-07-16 14:42:59 INFO Missing dates: 0
Found 2 datetimes 2.
2024-07-16 14:43:00 INFO Note: detected 128 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
2024-07-16 14:43:00 INFO Note: NumExpr detected 128 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
2024-07-16 14:43:00 INFO NumExpr defaulting to 8 threads.
2024-07-16 14:43:00 ERROR Error in execute
Traceback (most recent call last):
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/input.py", line 433, in datasource
    return self.action.function(FunctionContext(self), self.dates, *args, **kwargs)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/functions/sources/netcdf.py", line 72, in execute
    return load_netcdfs("📁", "path", context, dates, path, *args, **kwargs)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/functions/sources/netcdf.py", line 66, in load_netcdfs
    check(what, ds, given_paths, valid_datetime=dates, **kwargs)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/functions/sources/netcdf.py", line 40, in check
    raise ValueError(f"Expected {count} fields, got {len(ds)} (kwargs={kwargs}, {what}s={paths})")
ValueError: Expected 2 fields, got 0 (kwargs={'valid_datetime': ['2013-01-01T00:00:00'], 'param': ['u10', 'v10'], 'levtype': 'sfc'}, paths=['./sample.nc'])
Traceback (most recent call last):
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/utils/cli.py", line 128, in cli_main
    cmd.run(args)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/commands/create.py", line 30, in run
    c.create()
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/__init__.py", line 153, in create
    self.init()
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/__init__.py", line 50, in init
    obj.initialise(check_name=check_name)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/loaders.py", line 271, in initialise
    variables = self.minimal_input.variables
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/functools.py", line 981, in __get__
    val = self.func(instance)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/input.py", line 227, in variables
    return self._coords.variables
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/functools.py", line 981, in __get__
    val = self.func(instance)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/input.py", line 190, in variables
    self._build_coords
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/functools.py", line 981, in __get__
    val = self.func(instance)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/input.py", line 143, in _build_coords
    from_data = self.owner.get_cube().user_coords
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/input.py", line 350, in get_cube
    ds = self.datasource
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/functools.py", line 981, in __get__
    val = self.func(instance)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/input.py", line 81, in wrapper
    result = method(self, *args, **kwargs)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/template.py", line 82, in wrapper
    result = method(self, *args, **kwargs)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/template.py", line 42, in wrapper
    result = method(self, *args, **kwargs)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/input.py", line 433, in datasource
    return self.action.function(FunctionContext(self), self.dates, *args, **kwargs)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/functions/sources/netcdf.py", line 72, in execute
    return load_netcdfs("📁", "path", context, dates, path, *args, **kwargs)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/functions/sources/netcdf.py", line 66, in load_netcdfs
    check(what, ds, given_paths, valid_datetime=dates, **kwargs)
  File "/expanse/nfs/cw3e/cwp167/envs/nwm-anemoi/lib/python3.10/site-packages/anemoi/datasets/create/functions/sources/netcdf.py", line 40, in check
    raise ValueError(f"Expected {count} fields, got {len(ds)} (kwargs={kwargs}, {what}s={paths})")
ValueError: Expected 2 fields, got 0 (kwargs={'valid_datetime': ['2013-01-01T00:00:00'], 'param': ['u10', 'v10'], 'levtype': 'sfc'}, paths=['./sample.nc'])
2024-07-16 14:43:00 ERROR 
💣 Expected 2 fields, got 0 (kwargs={'valid_datetime': ['2013-01-01T00:00:00'], 'param': ['u10', 'v10'], 'levtype': 'sfc'}, paths=['./sample.nc'])
2024-07-16 14:43:00 ERROR 💣 Exiting

Accompanying data

No response

Organisation

No response

b8raoult commented 1 month ago

Sorry about the delay. We have done a lot of work on NetCDF. Can you try again with the latest version? Also, if you plan to use data from the CDS, I suggest that you download them in grib, so you avoid some unnecessary conversion, and it will be faster.

jbanomedina commented 1 month ago

Thank you very much for working on this, and for developing this amazing tool. I tried again with the last version, and the previous problem was solved. I am now getting the error below using ERA5, but does not seem critical since the .zarr file obtained seems to be fine, and I am able to open it with Python using the anemoi-datasets library. Could this error probably come from the fact that ERA5 is not a forecast and therefore it does not contain the attribute "forecast_reference_time"?

anemoi-datasets create recipe-era5-test.yaml ${workdir}/data/era5/era5_${yearInit}-01-01.zarr
2024-10-14 13:56:02 INFO Task init((),{}) starting
2024-10-14 13:56:08 INFO Setting flatten_grid=True in config
2024-10-14 13:56:08 INFO Setting ensemble_dimension=2 in config
2024-10-14 13:56:08 INFO Setting flatten_grid=True in config
2024-10-14 13:56:08 INFO Setting ensemble_dimension=2 in config
2024-10-14 13:56:08 INFO {'start': datetime.datetime(2013, 1, 1, 0, 0), 'end': datetime.datetime(2013, 1, 1, 18, 0), 'frequency': '6h', 'group_by': 'monthly'}
2024-10-14 13:56:08 INFO Groups(dates=1)
2024-10-14 13:56:08 INFO FunctionAction: path=./era5_2013-01-01.nc param=['10u'] 
2024-10-14 13:56:11 INFO Minimal input for 'init' step (using only the first date) :
2024-10-14 13:56:11 INFO netcdf(['2013-01-01T00:00:00'])
2024-10-14 13:56:11 INFO Config loaded ok:
2024-10-14 13:56:11 INFO Found 4 datetimes.
2024-10-14 13:56:11 INFO Dates: Found 4 datetimes, in 1 groups: 
2024-10-14 13:56:11 INFO Missing dates: 0
2024-10-14 13:57:22 INFO Found 1 variables : 10u.
2024-10-14 13:57:22 INFO Found 1 ensembles : 0.
2024-10-14 13:57:22 INFO gridpoints size: [1038240, 1038240]
2024-10-14 13:57:22 INFO resolution=None
2024-10-14 13:57:22 INFO total_shape = [4, 1, 1, 1038240]
2024-10-14 13:57:22 INFO chunks=(1, 1, 1, 1038240)
2024-10-14 13:57:22 INFO Creating Dataset './era5_2013-01-01.zarr', with total_shape=[4, 1, 1, 1038240], chunks=(1, 1, 1, 1038240) and dtype='float32'
2024-10-14 13:57:22 ERROR Error in retrieving metadata (cannot build data request info) for XArrayMetadata({'variable': '10u', 'time': '0000', 'date': '20130101', 'step': 0, 'valid_datetime': '2013-01-01T00:00:00'})
Traceback (most recent call last):
  File "./envs/nwm-anemoi/lib/python3.12/site-packages/anemoi/datasets/create/input.py", line 111, in _data_request
    date = field.datetime()["valid_time"]
           ^^^^^^^^^^^^^^^^
  File "./envs/nwm-anemoi/lib/python3.12/site-packages/earthkit/data/core/fieldlist.py", line 512, in datetime
    return self._metadata.datetime()
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./envs/nwm-anemoi/lib/python3.12/site-packages/earthkit/data/core/metadata.py", line 312, in datetime
    "base_time": self._base_datetime(),
                 ^^^^^^^^^^^^^^^^^^^^^
  File "./envs/nwm-anemoi/lib/python3.12/site-packages/anemoi/datasets/create/functions/sources/xarray/metadata.py", line 84, in _base_datetime
    return self._field.forecast_reference_time
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "./envs/nwm-anemoi/lib/python3.12/site-packages/anemoi/datasets/create/functions/sources/xarray/field.py", line 106, in forecast_reference_time
    return self.owner.forecast_reference_time
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Variable' object has no attribute 'forecast_reference_time'
2024-10-14 13:57:22 WARNING Dataset name error: the dataset name 'era5_2013-01-01' does not follow naming convention. Does not match ^(\w+)-([\w-]+)-(\w+)-(\w+)-(\d\d\d\d)-(\d\d\d\d)-(\d+h)-v(\d+)-?([a-zA-Z0-9-]+)?$
2024-10-14 13:57:24 INFO Number of years 0 < 10, leaving out 20%. end=np.datetime64('2013-01-01T12:00:00')
2024-10-14 13:57:24 INFO Will compute statistics from 2013-01-01T00:00:00 to 2013-01-01T12:00:00
2024-10-14 13:57:24 INFO Task load((),{}) starting
2024-10-14 13:57:24 INFO {'end': '2013-01-01T18:00:00', 'frequency': '6h', 'group_by': 'monthly', 'start': '2013-01-01T00:00:00'}
2024-10-14 13:57:24 INFO Groups(dates=1)
2024-10-14 13:57:24 INFO FunctionAction: param=['10u'] path=./era5_2013-01-01.nc 
Loading 3/4: 100%|████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  5.68it/s]
2024-10-14 13:57:28 INFO Name               : /data
Type               : zarr.core.Array
Data type          : float32
Shape              : (4, 1, 1, 1038240)
Chunk shape        : (1, 1, 1, 1038240)
Order              : C
Read-only          : True
Compressor         : Blosc(cname='lz4', clevel=5, shuffle=SHUFFLE, blocksize=0)
Store type         : zarr.storage.DirectoryStore
No. bytes          : 16611840 (15.8M)
No. bytes stored   : 13288828 (12.7M)
Storage ratio      : 1.3
Chunks initialized : 4/4

2024-10-14 13:57:28 INFO Task finalise((),{}) starting
2024-10-14 13:57:28 INFO Variables minimum maximum mean stdev has_nans
10u -21.56 22.45 -0.37 5.77 0.00
2024-10-14 13:57:28 INFO Wrote statistics in ./era5_2013-01-01.zarr
Computing size of ./era5_2013-01-01.zarr: 16it [00:00, 4772.02it/s]
2024-10-14 13:57:28 INFO Total size: 12.7 MiB
2024-10-14 13:57:28 INFO Total number of files: 62
2024-10-14 13:57:28 INFO Task patch((),{}) starting
2024-10-14 13:57:28 INFO ✅ Remove _create_yaml_config
2024-10-14 13:57:28 INFO Dataset changed by patch
2024-10-14 13:57:28 INFO Task init_additions((),{}) starting
2024-10-14 13:57:28 WARNING No delta found in kwargs, no addtions will be computed.
2024-10-14 13:57:28 INFO Task run_additions((),{}) starting
2024-10-14 13:57:28 WARNING No delta found in kwargs, no addtions will be computed.
2024-10-14 13:57:28 INFO Task finalise_additions((),{}) starting
2024-10-14 13:57:28 WARNING No delta found in kwargs, no addtions will be computed.
Computing size of ./era5_2013-01-01.zarr: 16it [00:00, 10111.32it/s]
2024-10-14 13:57:28 INFO Total size: 12.7 MiB
2024-10-14 13:57:28 INFO Total number of files: 62
2024-10-14 13:57:28 INFO Task cleanup((),{}) starting
2024-10-14 13:57:28 INFO Task verify((),{}) starting
2024-10-14 13:57:28 INFO Verifying dataset at ./era5_2013-01-01.zarr
2024-10-14 13:57:28 INFO ./era5_2013-01-01.zarr
2024-10-14 13:57:28 INFO Create completed in 1 minute 25 seconds