ecmwf / climetlab

Python package for easy access to weather and climate data
Apache License 2.0
374 stars 57 forks source link

`load_source` does not work in my system with `'url'` data source #47

Closed iago-pssjd closed 1 year ago

iago-pssjd commented 1 year ago

Updated

When I try to execute the next instruction

ds = cml.load_source('url', 'https://github.com/ecmwf/climetlab/raw/main/docs/examples/test.grib')

in my computer (OS: Debian 11) I get the following output/ error message:

test.grib:   0%|          | 0.00/1.03k [00:00<?, ?B/s]
CliMetLab cache: trying to free 14.9 GiB
Deleting entry {
    "path": "/tmp/climetlab-iago/grib-index-213de367fa7e1865472d2aaf9a729f7894defcf8f4ced5a23622f24928f06e5d.json",
    "owner": "grib-index",
    "args": [
        "/tmp/climetlab-iago/url-15280dbd4547333ede9ffec63d6959450329b9c003a148969685679b82657cba.grib",
        1677685539.2853959,
        1677685539.2813954,
        1052,
        0
    ],
    "creation_date": "2023-03-01 16:45:40.403550",
    "flags": 0,
    "owner_data": null,
    "last_access": "2023-03-01 16:45:40.403550",
    "type": "file",
    "parent": null,
    "replaced": null,
    "extra": null,
    "expires": null,
    "accesses": 1,
    "size": 4
}
CliMetLab cache: deleting /tmp/climetlab-iago/grib-index-213de367fa7e1865472d2aaf9a729f7894defcf8f4ced5a23622f24928f06e5d.json (4)
CliMetLab cache: grib-index ["/tmp/climetlab-iago/url-15280dbd4547333ede9ffec63d6959450329b9c003a148969685679b82657cba.grib", 1677685539.2853959, 1677685539.2813954, 1052, 0]
CliMetLab cache: could not free 14.9 GiB
CliMetLab cache: trying to free 14.9 GiB
Deleting entry {
    "path": "/tmp/climetlab-iago/url-15280dbd4547333ede9ffec63d6959450329b9c003a148969685679b82657cba.grib",
    "owner": "url",
    "args": {
        "url": "https://github.com/ecmwf/climetlab/raw/main/docs/examples/test.grib",
        "parts": null
    },
    "creation_date": "2023-03-01 16:47:48.942739",
    "flags": 0,
    "owner_data": {
        "connection": "keep-alive",
        "content-length": "1052",
        "cache-control": "max-age=300",
        "content-security-policy": "default-src 'none'; style-src 'unsafe-inline'; sandbox",
        "content-type": "application/octet-stream",
        "etag": "W/\"2bd5b56b1c0727c2971a7d94f9c3f22c13a72f1d78388827fc1261b2a9530e42\"",
        "strict-transport-security": "max-age=31536000",
        "x-content-type-options": "nosniff",
        "x-frame-options": "deny",
        "x-xss-protection": "1; mode=block",
        "x-github-request-id": "206A:0F3A:12C19C7:140943B:63FF7322",
        "accept-ranges": "bytes",
        "date": "Wed, 01 Mar 2023 15:47:49 GMT",
        "via": "1.1 varnish",
        "x-served-by": "cache-mad22020-MAD",
        "x-cache": "HIT",
        "x-cache-hits": "1",
        "x-timer": "S1677685669.193550,VS0,VE1",
        "vary": "Authorization,Accept-Encoding,Origin",
        "access-control-allow-origin": "*",
        "x-fastly-request-id": "3408d1c8e5cf268f976443336503d5442163d118",
        "expires": "Wed, 01 Mar 2023 15:52:49 GMT",
        "source-age": "131"
    },
    "last_access": "2023-03-01 16:47:48.942739",
    "type": "file",
    "parent": null,
    "replaced": null,
    "extra": null,
    "expires": null,
    "accesses": 1,
    "size": 1052
}
CliMetLab cache: deleting /tmp/climetlab-iago/url-15280dbd4547333ede9ffec63d6959450329b9c003a148969685679b82657cba.grib (1 KiB)
CliMetLab cache: url {"url": "https://github.com/ecmwf/climetlab/raw/main/docs/examples/test.grib", "parts": null}
CliMetLab cache: could not free 14.9 GiB

And if I try

URL = "https://www.ncei.noaa.gov/data/international-best-track-archive-for-climate-stewardship-ibtracs/v04r00/access/csv/ibtracs.SP.list.v04r00.csv"
data = cml.load_source("url", URL)

then I get

ibtracs.SP.list.v04r00.csv:   0%|          | 0.00/33.2M [00:00<?, ?B/s]
CliMetLab cache: trying to free 14.9 GiB
Deleting entry {
    "path": "/tmp/climetlab-iago/grib-index-4a7b3dd2dd0a13c559af337be1026033c5f30b222383355c46c7fe2bb36a2b73.json",
    "owner": "grib-index",
    "args": [
        "/tmp/climetlab-iago/url-15280dbd4547333ede9ffec63d6959450329b9c003a148969685679b82657cba.grib",
        1677685669.5551624,
        1677685669.551163,
        1052,
        0
    ],
    "creation_date": "2023-03-01 16:47:49.630884",
    "flags": 0,
    "owner_data": null,
    "last_access": "2023-03-01 16:47:49.630884",
    "type": "file",
    "parent": null,
    "replaced": null,
    "extra": null,
    "expires": null,
    "accesses": 1,
    "size": 4
}
CliMetLab cache: deleting /tmp/climetlab-iago/grib-index-4a7b3dd2dd0a13c559af337be1026033c5f30b222383355c46c7fe2bb36a2b73.json (4)
CliMetLab cache: grib-index ["/tmp/climetlab-iago/url-15280dbd4547333ede9ffec63d6959450329b9c003a148969685679b82657cba.grib", 1677685669.5551624, 1677685669.551163, 1052, 0]
CliMetLab cache: could not free 14.9 GiB

Further, using cml.load_source always produces the message CliMetLab cache: could not free 14.9 GiB

What may be the issues?

Thank you!

iago-pssjd commented 1 year ago

I realize now the output messages depend on previously run code (as I have tried several times to run these instructions). In my current computer I cannot run it, but I will try to produce cleaner outputs this afternoon.

iago-pssjd commented 1 year ago

cc @floriankrb (I see you are also involved in https://github.com/ecmwf-projects/mooc-machine-learning-weather-climate, where I come from) I tried to trace cml.load_source('url', 'https://github.com/ecmwf/climetlab/raw/main/docs/examples/test.grib') both in my computer, where it does not work, and in deepnote, where it works. I copy the first lines, where already diverges the behaviour and separate diverging behaviour blocks by two empty lines:

Input

import climetlab as cml
import cProfile
cProfile.run("cml.load_source('url', 'https://github.com/ecmwf/climetlab/raw/main/docs/examples/test.grib')", sort = 'cumulative')

Output

Deepnote (well)

         15827 function calls (15819 primitive calls) in 0.271 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.271    0.271 {built-in method builtins.exec}
        1    0.000    0.000    0.271    0.271 <string>:1(<module>)
        1    0.000    0.000    0.271    0.271 __init__.py:155(load_source)
        2    0.000    0.000    0.270    0.135 __init__.py:18(__call__)
        2    0.000    0.000    0.269    0.135 caching.py:620(cache_file)

        1    0.000    0.000    0.266    0.266 __init__.py:131(__call__)
        1    0.000    0.000    0.266    0.266 url.py:107(__init__)
        1    0.000    0.000    0.265    0.265 __init__.py:51(cache_file)

        1    0.000    0.000    0.262    0.262 url.py:175(out_of_date)
        1    0.000    0.000    0.262    0.262 http.py:141(out_of_date)
        1    0.000    0.000    0.261    0.261 http.py:62(headers)
        1    0.000    0.000    0.261    0.261 http.py:462(wrapped)
        1    0.000    0.000    0.261    0.261 api.py:88(head)
        1    0.000    0.000    0.261    0.261 api.py:14(request)
        1    0.000    0.000    0.259    0.259 sessions.py:500(request)
      2/1    0.000    0.000    0.258    0.258 sessions.py:671(send)
        2    0.000    0.000    0.256    0.128 adapters.py:436(send)
        2    0.000    0.000    0.255    0.127 connectionpool.py:522(urlopen)
        2    0.000    0.000    0.254    0.127 connectionpool.py:361(_make_request)
        2    0.000    0.000    0.194    0.097 client.py:1333(getresponse)
        2    0.000    0.000    0.194    0.097 client.py:313(begin)
       42    0.000    0.000    0.193    0.005 {method 'readline' of '_io.BufferedReader' objects}
        2    0.000    0.000    0.193    0.097 client.py:280(_read_status)
        4    0.000    0.000    0.193    0.048 socket.py:690(readinto)
        4    0.000    0.000    0.193    0.048 ssl.py:1231(recv_into)
        4    0.000    0.000    0.193    0.048 ssl.py:1091(read)
        4    0.193    0.048    0.193    0.048 {method 'read' of '_ssl._SSLSocket' objects}
        1    0.000    0.000    0.169    0.169 sessions.py:723(<listcomp>)
      3/2    0.000    0.000    0.169    0.085 sessions.py:159(resolve_redirects)
        2    0.000    0.000    0.059    0.030 connectionpool.py:1034(_validate_conn)
        2    0.000    0.000    0.059    0.030 connection.py:356(connect)
        2    0.000    0.000    0.033    0.017 connection.py:161(_new_conn)
        2    0.000    0.000    0.033    0.017 connection.py:37(create_connection)

My computer (Debian 11) (bad)

CliMetLab cache: trying to free 14.9 GiB
CliMetLab cache: could not free 14.9 GiB
         262886 function calls (257495 primitive calls) in 5.904 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     89/1    0.000    0.000    5.904    5.904 {built-in method builtins.exec}
        1    0.000    0.000    5.904    5.904 <string>:1(<module>)
        1    0.000    0.000    5.904    5.904 __init__.py:155(load_source)
        2    0.000    0.000    5.682    2.841 __init__.py:18(__call__)
        2    0.000    0.000    5.678    2.839 caching.py:620(cache_file)

        4    0.000    0.000    5.100    1.275 caching.py:101(wrapped)
        5    0.000    0.000    5.099    1.020 threading.py:280(wait)
        4    0.000    0.000    5.099    1.275 caching.py:139(result)
      107    5.099    0.048    5.099    0.048 {method 'acquire' of '_thread.lock' objects}
CliMetLab cache: trying to free 14.9 GiB

        1    0.000    0.000    5.040    5.040 __init__.py:131(__call__)
        1    0.000    0.000    4.986    4.986 url.py:107(__init__)
        1    0.000    0.000    4.985    4.985 __init__.py:51(cache_file)

        1    0.000    0.000    0.863    0.863 file.py:40(mutate)
        1    0.000    0.000    0.863    0.863 file.py:70(_reader)
        1    0.000    0.000    0.863    0.863 __init__.py:115(reader)
        1    0.000    0.000    0.844    0.844 __init__.py:14(reader)
        1    0.000    0.000    0.696    0.696 reader.py:21(__init__)
        1    0.000    0.000    0.696    0.696 index.py:356(__init__)
        1    0.000    0.000    0.695    0.695 caching.py:702(auxiliary_cache_file)
        1    0.000    0.000    0.576    0.576 url.py:161(download)
        1    0.000    0.000    0.576    0.576 base.py:109(download)
        2    0.000    0.000    0.566    0.283 http.py:462(wrapped)
        2    0.000    0.000    0.566    0.283 api.py:16(request)
Deleting entry {
    "path": "/tmp/climetlab-iago/url-15280dbd4547333ede9ffec63d6959450329b9c003a148969685679b82657cba.grib",
    "owner": "url",
    "args": {
        "url": "https://github.com/ecmwf/climetlab/raw/main/docs/examples/test.grib",
        "parts": null
    },
    "creation_date": "2023-03-02 09:46:00.308167",
    "flags": 0,
    "owner_data": {
        "connection": "keep-alive",
        "content-length": "1052",
        "cache-control": "max-age=300",
        "content-security-policy": "default-src 'none'; style-src 'unsafe-inline'; sandbox",
        "content-type": "application/octet-stream",
        "etag": "W/\"2bd5b56b1c0727c2971a7d94f9c3f22c13a72f1d78388827fc1261b2a9530e42\"",
        "strict-transport-security": "max-age=31536000",
        "x-content-type-options": "nosniff",
        "x-frame-options": "deny",
        "x-xss-protection": "1; mode=block",
        "x-github-request-id": "17FE:1218:1E5A1E4:2082EA5:640045C8",
        "accept-ranges": "bytes",
        "date": "Thu, 02 Mar 2023 08:46:01 GMT",
        "via": "1.1 varnish",
        "x-served-by": "cache-mad22078-MAD",
        "x-cache": "HIT",
        "x-cache-hits": "1",
        "x-timer": "S1677746762.543576,VS0,VE1",
        "vary": "Authorization,Accept-Encoding,Origin",
        "access-control-allow-origin": "*",
        "x-fastly-request-id": "47f21b8841a9a58e4a862c424aecee6504733313",
        "expires": "Thu, 02 Mar 2023 08:51:01 GMT",
        "source-age": "68"
    },
    "last_access": "2023-03-02 09:46:00.308167",
    "type": "file",
    "parent": null,
    "replaced": null,
    "extra": null,
    "expires": null,
    "accesses": 1,
    "size": 1052
}
        2    0.000    0.000    0.561    0.280 sessions.py:470(request)
      4/2    0.000    0.000    0.556    0.278 sessions.py:626(send)
        4    0.000    0.000    0.544    0.136 adapters.py:394(send)
        4    0.000    0.000    0.538    0.134 connectionpool.py:522(urlopen)
        4    0.000    0.000    0.534    0.133 connectionpool.py:361(_make_request)
        4    0.000    0.000    0.389    0.097 connectionpool.py:1034(_validate_conn)
        4    0.001    0.000    0.389    0.097 connection.py:356(connect)
        1    0.000    0.000    0.294    0.294 http.py:249(estimate_size)
        3    0.000    0.000    0.294    0.098 http.py:62(headers)
CliMetLab cache: deleting /tmp/climetlab-iago/url-15280dbd4547333ede9ffec63d6959450329b9c003a148969685679b82657cba.grib (1 KiB)
        1    0.000    0.000    0.293    0.293 api.py:92(head)
CliMetLab cache: url {"url": "https://github.com/ecmwf/climetlab/raw/main/docs/examples/test.grib", "parts": null}
        1    0.000    0.000    0.279    0.279 http.py:119(transfer)
        1    0.000    0.000    0.273    0.273 http.py:286(make_stream)
        1    0.000    0.000    0.273    0.273 http.py:212(issue_request)
        1    0.000    0.000    0.273    0.273 api.py:64(get)
        2    0.000    0.000    0.271    0.135 __init__.py:13(<module>)
        4    0.000    0.000    0.217    0.054 ssl_.py:355(ssl_wrap_socket)
    90/21    0.001    0.000    0.172    0.008 <frozen importlib._bootstrap>:1002(_find_and_load)
    90/21    0.000    0.000    0.171    0.008 <frozen importlib._bootstrap>:967(_find_and_load_unlocked)
    89/21    0.001    0.000    0.167    0.008 <frozen importlib._bootstrap>:659(_load_unlocked)
    86/21    0.000    0.000    0.165    0.008 <frozen importlib._bootstrap_external>:784(exec_module)
    99/21    0.000    0.000    0.159    0.008 <frozen importlib._bootstrap>:220(_call_with_frames_removed)
        4    0.000    0.000    0.158    0.039 connection.py:161(_new_conn)
        4    0.000    0.000    0.158    0.039 connection.py:37(create_connection)
iago-pssjd commented 1 year ago

Update:

I solved the issue by increasing maximum-cache-disk-usage. But then,

# Disk usage threshold after which CliMetLab expires older cached entries (% of the full disk capacity).

When CliMetLab cache disk usage goes above this limit, CliMetLab triggers its cache cleaning mechanism before downloading additional data.

the issue is that CliMetLab is not able to expire older cache entries (CliMetLab cache: could not free 14.9 GiB)?

floriankrb commented 1 year ago

Yes, it looks like there is some issue cleaning the cache. Perhaps you updated to a more recent version of climetlab ? or have you updated some of the depending packages? To solve this you can use $ climetlab cache and try finding and deleting the 14.9GiB entry.

If nothing works, climetlab decache --all will clean the cache completely.

If even this fails, you could delete directly the cache folder : $ climetlab settings cache-directory will give you the cache directory (it seems to be /tmp/climetlab-iago in your case). Then manually delete the folder (with rm).

iago-pssjd commented 1 year ago

@floriankrb

Thanks for your answer. I tried indeed as you suggest, removing the cache completely before executing cml.load_source('url', 'https://github.com/ecmwf/climetlab/raw/main/docs/examples/test.grib'), and the output was the one I show above in my first comment (when I had my computer disk usage over default maximum-cache-disk-usage = 90%).

To get it working I had to replace maximum-cache-disk-usage with a percentage higher than my current disk usage.

On the other hand, this is an issue produced when I was trying https://github.com/ecmwf-projects/mooc-machine-learning-weather-climate/blob/main/tier_2/data_handling/01-accessing-data.ipynb. Thankfully, notebooks 2 and 3 of the same series allowed me to get a greater understanding of these issues and to arrive to the solution found.