chaidiscovery / chai-lab

Chai-1, SOTA model for biomolecular structure prediction
https://www.chaidiscovery.com
Other
1.05k stars 134 forks source link

Error Loading Model in Offline Docker Container Despite Pre-Downloaded Weights #92

Open Nicholas-Freitas opened 1 hour ago

Nicholas-Freitas commented 1 hour ago

Hello all,

I'm running Chai-1 in a docker container on a node with no internet access (this is a security necessity for my university's compute cluster). I've pre-downloaded the model weights and ESM embedding weights into a directory within the docker container, which I reference using the CHAI_DOWNLOADS_DIR environment variable.

When I run the predict_structure.py example in the docker container, it runs to completion and doesn't appear to be downloading anything (I've checked the files in the download directory). However, if I launch the docker container with no internet connection using the --network=none parameter, I get an error which I've pasted in full below.

The error suggests it's unable to find the ESM model, despite it being downloaded in the default location, $CHAI_DOWNLOADS_DIR/esm/models--facebook--esm2_t36_3B_UR50D. Could this be because esm.py is expecting the model to be saved under $CHAI_DOWNLOADS_DIR/esm/facebook/esm2_t36_3B_UR50D, which is different from the actual download location?

https://github.com/chaidiscovery/chai-lab/blob/306f53c7f45a6dd082aabd0d82a36d64fcf51f82/chai_lab/data/dataset/embeddings/esm.py#L57

Here's the full error message, thanks for the help!


/usr/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
Traceback (most recent call last):
  File "/usr/lib/python3.11/site-packages/urllib3/connection.py", line 196, in _new_conn
    sock = connection.create_connection(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/socket.py", line 961, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
socket.gaierror: [Errno -3] Temporary failure in name resolution

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.11/site-packages/urllib3/connectionpool.py", line 789, in urlopen
    response = self._make_request(
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/urllib3/connectionpool.py", line 490, in _make_request
    raise new_e
  File "/usr/lib/python3.11/site-packages/urllib3/connectionpool.py", line 466, in _make_request
    self._validate_conn(conn)
  File "/usr/lib/python3.11/site-packages/urllib3/connectionpool.py", line 1095, in _validate_conn
    conn.connect()
  File "/usr/lib/python3.11/site-packages/urllib3/connection.py", line 615, in connect
    self.sock = sock = self._new_conn()
                       ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/urllib3/connection.py", line 203, in _new_conn
    raise NameResolutionError(self.host, self, e) from e
urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7fd8e1236b10>: Failed to resolve 'huggingface.co' ([Errno -3] Temporary failure in name resolution)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.11/site-packages/requests/adapters.py", line 667, in send
    resp = conn.urlopen(
           ^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/urllib3/connectionpool.py", line 843, in urlopen
    retries = retries.increment(
              ^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/urllib3/util/retry.py", line 519, in increment
    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /facebook/esm2_t36_3B_UR50D/resolve/main/model.safetensors.index.json (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fd8e1236b10>: Failed to resolve 'huggingface.co' ([Errno -3] Temporary failure in name resolution)"))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/example_run.py", line 27, in <module>
    output_cif_paths = run_inference(
                       ^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/chai_lab/chai1.py", line 296, in run_inference
    embedding_context = get_esm_embedding_context(chains, device=device)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/jaxtyping/_decorator.py", line 522, in wrapped_fn
    return wrapped_fn_impl(args, kwargs, bound, memos)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/jaxtyping/_decorator.py", line 449, in wrapped_fn_impl
    out = fn(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/chai_lab/data/dataset/embeddings/esm.py", line 84, in get_esm_embedding_context
    protein_seq2emb_context = _get_esm_contexts_for_sequences(
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/chai_lab/data/dataset/embeddings/esm.py", line 63, in _get_esm_contexts_for_sequences
    with esm_model(model_name=model_name, device=device) as model:
  File "/usr/lib/python3.11/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/chai_lab/data/dataset/embeddings/esm.py", line 38, in esm_model
    EsmModel.from_pretrained(model_name, cache_dir=esm_cache_folder)
  File "/usr/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3616, in from_pretrained
    if not has_file(pretrained_model_name_or_path, safe_weights_name, **has_file_kwargs):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/transformers/utils/hub.py", line 655, in has_file
    response = get_session().head(
               ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/requests/sessions.py", line 624, in head
    return self.request("HEAD", url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/huggingface_hub/utils/_http.py", line 93, in send
    return super().send(request, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/requests/adapters.py", line 700, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: (MaxRetryError('HTTPSConnectionPool(host=\'huggingface.co\', port=443): Max retries exceeded with url: /facebook/esm2_t36_3B_UR50D/resolve/main/model.safetensors.index.json (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7fd8e1236b10>: Failed to resolve \'huggingface.co\' ([Errno -3] Temporary failure in name resolution)"))'), '(Request ID: c19518a7-d773-417a-91eb-f5829148d925)')```
arogozhnikov commented 1 hour ago

Hi Nicholas,

thx for detailed issue report.

If if read your logs correctly, huggingface first wants to establish a connection to confirm your download, because failure happens during this check:

    if not has_file(pretrained_model_name_or_path, safe_weights_name, **has_file_kwargs):