NVIDIA / earth2studio

Open-source deep-learning framework for exploring, building and deploying AI weather/climate workflows.
https://nvidia.github.io/earth2studio/
Apache License 2.0
73 stars 23 forks source link

🐛[BUG]: ARCO download timeout #95

Closed jleinonen closed 1 month ago

jleinonen commented 1 month ago

Version

Latest from Github

On which installation method(s) does this occur?

Source

Describe the issue

When trying to inference the example workflow from https://github.com/NVIDIA/earth2studio/issues/91#issuecomment-2229512188, I get the following error while the script is downloading data:

2024-07-17 05:28:22.207 | INFO     | earth2studio.run:ensemble:294 - Running ensemble inference!
2024-07-17 05:28:22.247 | INFO     | earth2studio.run:ensemble:302 - Inference device: cuda
2024-07-17 05:28:23.070 | DEBUG    | earth2studio.data.arco:fetch_array:200 - Fetching ARCO zarr array for variable: u10m at 2022-01-01T12:00:00
2024-07-17 05:28:24.931 | DEBUG    | earth2studio.data.arco:fetch_array:200 - Fetching ARCO zarr array for variable: v10m at 2022-01-01T12:00:00
<cut output for many variables>
2024-07-17 05:30:19.831 | DEBUG    | earth2studio.data.arco:fetch_array:200 - Fetching ARCO zarr array for variable: z300 at 2022-01-01T12:00:00
Fetching ARCO data:  49%|████████████████████████▋                         | 36/73 [01:57<02:04,  3.36s/it]Traceback (most recent call last):
  File "/root/earth2studio/earth2studio/data/arco.py", line 172, in create_data_array
    async for t, v, data in unordered_generator(  # type: ignore[misc,unused-ignore]
  File "/root/earth2studio/earth2studio/data/utils.py", line 251, in unordered_generator
    async for task in _limit_concurrency(func_map, limit):
  File "/root/earth2studio/earth2studio/data/utils.py", line 286, in _limit_concurrency
    done, pending = await asyncio.wait(pending, return_when=asyncio.FIRST_COMPLETED)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 384, in wait
    return await _wait(fs, timeout, return_when, loop)
  File "/usr/lib/python3.10/asyncio/tasks.py", line 491, in _wait
    await waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/earth2studio-ensemble/ensemble-test.py", line 15, in <module>
    io = ensemble(
  File "/root/earth2studio/earth2studio/run.py", line 308, in ensemble
    x0, coords0 = fetch_data(
  File "/root/earth2studio/earth2studio/data/utils.py", line 70, in fetch_data
    da0 = source(adjust_times, variable)
  File "/root/earth2studio/earth2studio/data/arco.py", line 115, in __call__
    xr_array = asyncio.run(
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/usr/lib/python3.10/asyncio/tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError

It looks like the ARCO data source hardcodes https://github.com/NVIDIA/earth2studio/blob/cb1c2306467f013601fc606596a2a5da1de4fa5d/earth2studio/data/arco.py#L85 so maybe that is the issue? If to, making the timeout longer and/or user configurable would probably solve the problem.

NickGeneva commented 1 month ago

Yeah, I put in the fixed time out to have a cap on the download. Figured 100s would be long enough for people but we can increase to say 10 minutes. The idea was to push people away from very large requests into memory in one go. But rather split it up with multiple calls. But if you are timing out even for just 73 channels, then we can bump it up.

Of course in the mean time you can change it via:

ds = ARCO()
ds.async_timeout = 1200
NickGeneva commented 1 month ago

If this continues to be a bigger problem, I'll change this to use an environment variable instead similar to the model packages, but I expect 10 minutes to be sufficiently enough time for the models we have.

jleinonen commented 1 month ago

Yes, I got a timeout for one timestep and 73 channels. Thanks for the change, that should fix it for me given that I managed to download about 50% of the data before it timed out.