ecmwf-lab / ai-models-graphcast

Apache License 2.0
64 stars 19 forks source link

Missing files during download #4

Closed ringsaturn closed 11 months ago

ringsaturn commented 11 months ago

Command:

ai-models --download-assets --input cds --date 20231118 --time 0000 graphcast --assets ../assets/assets/

It appears that code was trying to read a file that jsut deleted:

2023-11-24 12:11:42,777 WARNING CliMetLab cache: deleting /tmp/climetlab-ringsaturn/cds-retriever-8f35388d579d1132d0fa0b6314eceaabd4dce7944a8a06f242bc36ab1edfd1b8.cache (15.8 MiB, 0.6 second)

Then:

2023-11-24 12:13:47,773 INFO Creating training data: 2 minutes 23 seconds.
2023-11-24 12:13:47,773 INFO Creating input data (total): 2 minutes 23 seconds.
2023-11-24 12:13:47,773 ERROR [Errno 2] No such file or directory: '/tmp/climetlab-ringsaturn/cds-retriever-8f35388d579d1132d0fa0b6314eceaabd4dce7944a8a06f242bc36ab1edfd1b8.cache'
Traceback (most recent call last):
  File "/home/ringsaturn/miniconda3/envs/graphcast/lib/python3.11/site-packages/ai_models/__main__.py", line 264, in _main
    model.run()
  File "/home/ringsaturn/graphcast-operational/ai-models-graphcast/ai_models_graphcast/model.py", line 205, in run
    start_date=self.start_date,
               ^^^^^^^^^^^^^^^
  File "/home/ringsaturn/miniconda3/envs/graphcast/lib/python3.11/functools.py", line 1001, in __get__
    val = self.func(instance)
          ^^^^^^^^^^^^^^^^^^^
  File "/home/ringsaturn/graphcast-operational/ai-models-graphcast/ai_models_graphcast/model.py", line 192, in start_date
    return self.all_fields.order_by(valid_datetime="descending")[0].datetime
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ringsaturn/miniconda3/envs/graphcast/lib/python3.11/site-packages/climetlab/core/index.py", line 210, in order_by
    indices = sorted(indices, key=functools.cmp_to_key(cmp))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ringsaturn/miniconda3/envs/graphcast/lib/python3.11/site-packages/climetlab/core/index.py", line 207, in cmp
    return order.compare_elements(self[i], self[j])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ringsaturn/miniconda3/envs/graphcast/lib/python3.11/site-packages/climetlab/core/index.py", line 87, in compare_elements
    n = v(a_metadata(k), b_metadata(k))
          ^^^^^^^^^^^^^
  File "/home/ringsaturn/miniconda3/envs/graphcast/lib/python3.11/site-packages/climetlab/readers/grib/codes.py", line 503, in metadata
    date = self.metadata("validityDate")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ringsaturn/miniconda3/envs/graphcast/lib/python3.11/site-packages/climetlab/readers/grib/codes.py", line 522, in metadata
    return self[name]
           ~~~~^^^^^^
  File "/home/ringsaturn/miniconda3/envs/graphcast/lib/python3.11/site-packages/climetlab/readers/grib/codes.py", line 528, in __getitem__
    proc = self.handle.get
           ^^^^^^^^^^^
  File "/home/ringsaturn/miniconda3/envs/graphcast/lib/python3.11/site-packages/climetlab/readers/grib/codes.py", line 334, in handle
    self._handle = CodesReader.from_cache(self.path).at_offset(self._offset)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ringsaturn/miniconda3/envs/graphcast/lib/python3.11/site-packages/climetlab/readers/grib/codes.py", line 307, in from_cache
    return cache[path]
           ~~~~~^^^^^^
  File "/home/ringsaturn/miniconda3/envs/graphcast/lib/python3.11/site-packages/climetlab/readers/grib/codes.py", line 280, in __getitem__
    c = self[key] = CodesReader(path)
                    ^^^^^^^^^^^^^^^^^
  File "/home/ringsaturn/miniconda3/envs/graphcast/lib/python3.11/site-packages/climetlab/readers/grib/codes.py", line 296, in __init__
    self.file = open(self.path, "rb")
                ^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/climetlab-ringsaturn/cds-retriever-8f35388d579d1132d0fa0b6314eceaabd4dce7944a8a06f242bc36ab1edfd1b8.cache'
2023-11-24 12:13:47,779 ERROR It is possible that some files requited by graphcast are missing.
2023-11-24 12:13:47,779 ERROR Rerun the command as:
2023-11-24 12:13:47,779 ERROR    /home/ringsaturn/miniconda3/envs/graphcast/bin/ai-models --download-assets --input cds --date 20231118 --time 0000 graphcast --assets ../assets/assets/
2023-11-24 12:13:47,779 INFO Total time: 2 minutes 26 seconds.

The full log is here: https://gist.github.com/ringsaturn/930fdd499b224f0e8778a6fd680808d3#file-errors-log-L558

floriankrb commented 11 months ago

2023-11-24 12:08:53,339 WARNING CliMetLab cache: trying to free 739.2 GiB 2023-11-24 12:08:53,339 WARNING Decaching files oldest than 2023-11-24T12:08:21.254908 (age: 32 seconds) 2023-11-24 12:08:53,352 WARNING CliMetLab cache: could not free 739.2 GiB

For some reason, CliMetLab could not free cache space and is deleting the files as it downloads them. This could happen for various reasons. Perhaps because you don't have enough space on your cache disk? This may help https://climetlab.readthedocs.io/en/latest/guide/caching.html

Adding the versions of the packages and the platform would also help debugging.

ringsaturn commented 11 months ago

2023-11-24 12:08:53,339 WARNING CliMetLab cache: trying to free 739.2 GiB 2023-11-24 12:08:53,339 WARNING Decaching files oldest than 2023-11-24T12:08:21.254908 (age: 32 seconds) 2023-11-24 12:08:53,352 WARNING CliMetLab cache: could not free 739.2 GiB

For some reason, CliMetLab could not free cache space and is deleting the files as it downloads them. This could happen for various reasons. Perhaps because you don't have enough space on your cache disk? This may help https://climetlab.readthedocs.io/en/latest/guide/caching.html

Adding the versions of the packages and the platform would also help debugging.

System info:

NAME="Ubuntu"
VERSION="18.04.6 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.6 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

I'm using Python3.11 via MiniConda. For the context, most relevant packages as belows:

climetlab==0.18.6
ai-models==0.2.15
-e git+https://github.com/ecmwf-lab/ai-models-graphcast@023e6537791cb9650e92b85251eb528f0948cc4e#egg=ai_models_graphcast

Or a much detailed freeze output,

Click to expand ``` absl-py==2.0.0 ai-models==0.2.15 -e git+https://github.com/ecmwf-lab/ai-models-graphcast@023e6537791cb9650e92b85251eb528f0948cc4e#egg=ai_models_graphcast aliyun-python-sdk-core==2.13.36 aliyun-python-sdk-kms==2.16.2 anyio==4.1.0 apopy==0.1.5 argon2-cffi==23.1.0 argon2-cffi-bindings==21.2.0 arrow==1.3.0 asttokens==2.4.1 async-lru==2.0.4 attrs==23.1.0 Babel==2.13.1 beautifulsoup4==4.12.2 bleach==6.1.0 branca==0.7.0 Cartopy==0.22.0 cdsapi==0.6.1 certifi==2023.11.17 cffi==1.16.0 cfgrib==0.9.10.4 cftime==1.6.3 charset-normalizer==3.3.2 chex==0.1.85 click==8.1.7 climetlab==0.18.6 cloudpickle==3.0.0 comm==0.2.0 contourpy==1.2.0 crcmod==1.7 cryptography==41.0.5 cycler==0.12.1 dask==2023.11.0 debugpy==1.8.0 decorator==5.1.1 defusedxml==0.7.1 dm-haiku==0.0.11 dm-tree==0.1.8 earthkit-meteo==0.0.1 eccodes==1.6.1 ecmwf-api-client==1.6.3 ecmwf-opendata==0.2.0 ecmwflibs==0.5.7 entrypoints==0.4 etils==1.5.2 executing==2.0.1 fastjsonschema==2.19.0 filelock==3.13.1 findlibs==0.0.5 flax==0.7.5 fonttools==4.45.1 fqdn==1.5.1 fsspec==2023.10.0 GPUtil==1.4.0 h11==0.14.0 httpcore==0.17.3 httpx==0.24.1 idna==3.4 imageio==2.33.0 importlib-metadata==6.8.0 importlib-resources==6.1.1 ipykernel==6.27.0 ipython==8.17.2 ipywidgets==8.1.1 isoduration==20.11.0 jax==0.4.20 jaxlib==0.4.20 jedi==0.19.1 Jinja2==3.1.2 jmespath==0.10.0 jmp==0.0.4 jraph==0.0.6.dev0 json5==0.9.14 jsonpointer==2.4 jsonschema==4.20.0 jsonschema-specifications==2023.11.1 jupyter-events==0.9.0 jupyter-lsp==2.2.0 jupyter_client==8.6.0 jupyter_core==5.5.0 jupyter_server==2.10.1 jupyter_server_terminals==0.4.4 jupyterlab==4.0.9 jupyterlab-widgets==3.0.9 jupyterlab_pygments==0.3.0 jupyterlab_server==2.25.2 kiwisolver==1.4.5 locket==1.0.0 Magics==1.5.8 Markdown==3.5.1 markdown-it-py==3.0.0 MarkupSafe==2.1.3 matplotlib==3.8.2 matplotlib-inline==0.1.6 mdurl==0.1.2 mistune==3.0.2 ml-dtypes==0.3.1 msgpack==1.0.7 multiurl==0.2.3.2 nbclient==0.9.0 nbconvert==7.11.0 nbformat==5.9.2 nest-asyncio==1.5.8 netCDF4==1.6.5 notebook_shim==0.2.3 numpngw==0.1.3 numpy==1.26.2 opt-einsum==3.3.0 optax==0.1.7 orbax-checkpoint==0.4.3 oss2==2.18.2 overrides==7.4.0 packaging==23.2 pandas==2.1.3 pandocfilters==1.5.0 parso==0.8.3 partd==1.4.1 pdbufr==0.11.0 pexpect==4.8.0 Pillow==10.1.0 platformdirs==4.0.0 prometheus-client==0.19.0 prompt-toolkit==3.0.41 protobuf==4.25.1 psutil==5.9.6 ptyprocess==0.7.0 pure-eval==0.2.2 pycparser==2.21 pycryptodome==3.19.0 Pygments==2.17.2 pyodc==1.3.0 pyparsing==3.1.1 pyproj==3.6.1 pyshp==2.3.1 python-dateutil==2.8.2 python-json-logger==2.0.7 pytz==2023.3.post1 PyYAML==6.0.1 pyzmq==25.1.1 referencing==0.31.0 requests==2.31.0 rfc3339-validator==0.1.4 rfc3986-validator==0.1.1 rich==13.7.0 rpds-py==0.13.1 Rtree==1.1.0 scipy==1.11.4 Send2Trash==1.8.2 shapely==2.0.2 six==1.16.0 sniffio==1.3.0 socksio==1.0.0 soupsieve==2.5 stack-data==0.6.3 tabulate==0.9.0 tensorstore==0.1.50 termcolor==2.3.0 terminado==0.18.0 tinycss2==1.2.1 toolz==0.12.0 tornado==6.3.3 tqdm==4.66.1 traitlets==5.13.0 trimesh==4.0.4 types-python-dateutil==2.8.19.14 typing_extensions==4.8.0 tzdata==2023.3 uri-template==1.3.0 urllib3==2.1.0 wcwidth==0.2.12 webcolors==1.13 webencodings==0.5.1 websocket-client==1.6.4 widgetsnbextension==4.0.9 xarray==2023.11.0 zipp==3.17.0 ```

The tmp dir was made via a soft link to a folder on a mounted disk:

lrwxrwxrwx   1 ringsaturn    ringsaturn      44 Nov 24 11:45 climetlab-ringsaturn -> /mnt/data22/ringsaturn/climetlab-ringsaturn/

which has 32G space. The disk status as belows:

Filesystem                                  Size  Used Avail Use% Mounted on
/dev/sdl1                                   7.3T  6.9T   32G 100% /mnt/data22
ringsaturn commented 11 months ago

I choose to use another disk with more free space and solve this. It’s not project’s bug.