coiled / feedback

A place to provide Coiled feedback
14 stars 3 forks source link

Cluster pulling software environment image with wrong hash #98

Closed jrbourbeau closed 3 years ago

jrbourbeau commented 3 years ago

Offline some users reported an issue where specifying an AWS region for creating a software environment & cluster doesn't work as expected. Here's a minimal example:

import coiled

# Create a software environment in AWS us-east-1 region
coiled.create_software_environment(
    name="test-region",
    pip=["dask", "distributed"],
    backend_options={"region": "us-east-1"},
)

# Create a cluster in AWS us-east-1 region
cluster = coiled.Cluster(
    software="test-region",
    backend_options={"region": "us-east-1"},
)

The initial software environment creation step works as expected and, going to AWS ECR I can see the corresponding image in us-east-1. Note that internally each software environment image gets a unique tag associated with it. For this particular image it's da5defc0-a00a-4c6b-ac04-ae6f8dccc573 and this matches up with what I see in ECR in us-east-1.

Initial software environment creation output: ``` Updating software environment... Solving conda environment... Conda environment solved! Building Docker image (this takes a few minutes) STEP 1: FROM coiled/default:sha-af843e5 STEP 2: COPY environment.yml environment.yml --> Using cache 633b2a92fa6419f503fabd7677086c3d302dc151908f71770a6ef8d699dc2ce8 --> 633b2a92fa6 STEP 3: RUN conda env update -n base -f environment.yml && rm environment.yml && conda clean --all -y && echo "conda activate base" >> ~/.bashrc --> Using cache 96355762aff760e2053d74e3c3a49fdc6ff78d13bb926fd12ae47bb2db049590 --> 96355762aff STEP 4: SHELL ["conda", "run", "-n", "base", "/bin/bash", "-c"] --> Using cache fb5bd66b676f8adf00645bdfe8556edb33b5331e3e16b66e9355f634669d1d24 --> fb5bd66b676 STEP 5: COPY requirements.txt requirements.txt --> 51969745cbb STEP 6: RUN pip install -r requirements.txt && rm requirements.txt Collecting dask Downloading dask-2020.12.0-py3-none-any.whl (884 kB) Collecting distributed Downloading distributed-2020.12.0-py3-none-any.whl (669 kB) Requirement already satisfied: setuptools in /opt/conda/lib/python3.8/site-packages (from distributed->-r requirements.txt (line 2)) (51.1.2.post20210112) Collecting click>=6.6 Downloading click-7.1.2-py2.py3-none-any.whl (82 kB) Collecting cloudpickle>=1.5.0 Downloading cloudpickle-1.6.0-py3-none-any.whl (23 kB) Collecting msgpack>=0.6.0 Downloading msgpack-1.0.2-cp38-cp38-manylinux1_x86_64.whl (302 kB) Collecting psutil>=5.0 Downloading psutil-5.8.0-cp38-cp38-manylinux2010_x86_64.whl (296 kB) Collecting sortedcontainers!=2.0.0,!=2.0.1 Downloading sortedcontainers-2.3.0-py2.py3-none-any.whl (29 kB) Collecting tblib>=1.6.0 Downloading tblib-1.7.0-py2.py3-none-any.whl (12 kB) Collecting toolz>=0.8.2 Downloading toolz-0.11.1-py3-none-any.whl (55 kB) Collecting tornado>=6.0.3 Downloading tornado-6.1-cp38-cp38-manylinux2010_x86_64.whl (427 kB) Collecting zict>=0.1.3 Downloading zict-2.0.0-py3-none-any.whl (10 kB) Collecting heapdict Downloading HeapDict-1.0.1-py3-none-any.whl (3.9 kB) Collecting pyyaml Downloading PyYAML-5.3.1.tar.gz (269 kB) Building wheels for collected packages: pyyaml Building wheel for pyyaml (setup.py): started Building wheel for pyyaml (setup.py): finished with status 'done' Created wheel for pyyaml: filename=PyYAML-5.3.1-cp38-cp38-linux_x86_64.whl size=44618 sha256=c561f874e7d2d6b89e2f36283050fa29d949d2f31d7bd9c84ce40ed170f71c60 Stored in directory: /root/.cache/pip/wheels/13/90/db/290ab3a34f2ef0b5a0f89235dc2d40fea83e77de84ed2dc05c Successfully built pyyaml Installing collected packages: pyyaml, heapdict, zict, tornado, toolz, tblib, sortedcontainers, psutil, msgpack, dask, cloudpickle, click, distributed Successfully installed click-7.1.2 cloudpickle-1.6.0 dask-2020.12.0 distributed-2020.12.0 heapdict-1.0.1 msgpack-1.0.2 psutil-5.8.0 pyyaml-5.3.1 sortedcontainers-2.3.0 tblib-1.7.0 toolz-0.11.1 tornado-6.1 zict-2.0.0 STEP 7: COMMIT da5defc0-a00a-4c6b-ac04-ae6f8dccc573 --> 5faad3abc31 5faad3abc31529a0ebd653cfe6fd6172b284c73459c761773284ebd3a1efada9 Docker build succeeded: da5defc0-a00a-4c6b-ac04-ae6f8dccc573 Uploading image Getting image source signatures Copying blob sha256:fd6130bf640d70726907154d1b535aebac02b53812df010f061f99d0b463779c Copying blob sha256:42851133c65c2a965b74a272b1737184e3a0b40c2b02ea29a6a9e9dc45d43971 Copying blob sha256:f2cb0ecef392f2a630fa1205b874ab2e2aedf96de04d0b8838e4e728e28142da Copying blob sha256:875120aa853cf59c6c5bc24af9f448a55f9b64db0bab58c9ee18f8a92ed8ac33 Copying blob sha256:fcd8d39597dd39d0c68670479e4d240fa9dba04a1246587384df9e1aa31b17d4 Copying blob sha256:33f493021a41ddb00dc033aaead880574243c055812e29d5ac785a34c4928648 Copying blob sha256:995d7fa2ff2cc9010b9aa7de17dd529d67b92a9d952eb2e0fd8e65b352ad7ed8 Copying blob sha256:a6ad50d7634d107ba35b7397f87a12a6061b7d600e59374a5a93399a5b28940b Copying config sha256:5faad3abc31529a0ebd653cfe6fd6172b284c73459c761773284ebd3a1efada9 Writing manifest to image destination Storing signatures Finished updating environment ```

However when we get to the cluster creation process, Coiled is not able to find the image for the requested software environment and instead re-builds the software environment.

Cluster creation output: ``` Creating Cluster. This takes about a minute ...Checking environment images Software environment not found, rebuilding. Building Docker image (this takes a few minutes) STEP 1: FROM coiled/default:sha-af843e5 STEP 2: COPY environment.yml environment.yml --> b7361a2c57e STEP 3: RUN conda env update -n base -f environment.yml && rm environment.yml && conda clean --all -y && echo "conda activate base" >> ~/.bashrc Collecting package metadata (repodata.json): ...working... done Solving environment: ...working... done Downloading and Extracting Packages ruamel_yaml-0.15.87 | 259 KB | ########## | 100% python-3.8.5 | 49.3 MB | ########## | 100% tqdm-4.42.1 | 56 KB | ########## | 100% certifi-2020.12.5 | 141 KB | ########## | 100% conda-4.9.2 | 2.9 MB | ########## | 100% libffi-3.3 | 50 KB | ########## | 100% pysocks-1.7.1 | 31 KB | ########## | 100% six-1.15.0 | 27 KB | ########## | 100% pyopenssl-20.0.1 | 49 KB | ########## | 100% ncurses-6.2 | 817 KB | ########## | 100% pycparser-2.20 | 94 KB | ########## | 100% urllib3-1.26.2 | 105 KB | ########## | 100% chardet-4.0.0 | 194 KB | ########## | 100% readline-8.0 | 356 KB | ########## | 100% conda-package-handli | 886 KB | ########## | 100% brotlipy-0.7.0 | 323 KB | ########## | 100% idna-2.10 | 50 KB | ########## | 100% requests-2.25.1 | 52 KB | ########## | 100% libedit-3.1.20191231 | 116 KB | ########## | 100% cryptography-3.3.1 | 566 KB | ########## | 100% cffi-1.14.4 | 226 KB | ########## | 100% xz-5.2.5 | 341 KB | ########## | 100% sqlite-3.33.0 | 1.1 MB | ########## | 100% openssl-1.1.1i | 2.5 MB | ########## | 100% setuptools-51.1.2 | 742 KB | ########## | 100% wheel-0.36.2 | 33 KB | ########## | 100% pip-20.3.3 | 1.8 MB | ########## | 100% tk-8.6.10 | 3.0 MB | ########## | 100% pycosat-0.6.3 | 82 KB | ########## | 100% ca-certificates-2020 | 121 KB | ########## | 100% Preparing transaction: ...working... done Verifying transaction: ...working... done Executing transaction: ...working... done # # To activate this environment, use # # $ conda activate base # # To deactivate an active environment, use # # $ conda deactivate Cache location: /opt/conda/pkgs Will remove the following tarballs: /opt/conda/pkgs --------------- ruamel_yaml-0.15.87-py38h7b6447c_0.conda 259 KB python-3.8.5-h7579374_1.conda 49.3 MB tqdm-4.42.1-py_0.conda 56 KB certifi-2020.12.5-py38h06a4308_0.conda 141 KB conda-4.9.2-py38h06a4308_0.conda 2.9 MB libffi-3.3-he6710b0_2.conda 50 KB pysocks-1.7.1-py38h06a4308_0.conda 31 KB six-1.15.0-py38h06a4308_0.conda 27 KB pyopenssl-20.0.1-pyhd3eb1b0_1.conda 49 KB ncurses-6.2-he6710b0_1.conda 817 KB pycparser-2.20-py_2.conda 94 KB urllib3-1.26.2-pyhd3eb1b0_0.conda 105 KB chardet-4.0.0-py38h06a4308_1003.conda 194 KB readline-8.0-h7b6447c_0.conda 356 KB conda-package-handling-1.7.2-py38h03888b9_0.conda 886 KB brotlipy-0.7.0-py38h27cfd23_1003.conda 323 KB idna-2.10-py_0.conda 50 KB requests-2.25.1-pyhd3eb1b0_0.conda 52 KB libedit-3.1.20191231-h14c3975_1.conda 116 KB cryptography-3.3.1-py38h3c74f83_0.conda 566 KB cffi-1.14.4-py38h261ae71_0.conda 226 KB xz-5.2.5-h7b6447c_0.conda 341 KB sqlite-3.33.0-h62c20be_0.conda 1.1 MB openssl-1.1.1i-h27cfd23_0.conda 2.5 MB setuptools-51.1.2-py38h06a4308_4.conda 742 KB wheel-0.36.2-pyhd3eb1b0_0.conda 33 KB pip-20.3.3-py38h06a4308_0.conda 1.8 MB tk-8.6.10-hbc83047_0.conda 3.0 MB pycosat-0.6.3-py38h7b6447c_1.conda 82 KB ca-certificates-2020.12.8-h06a4308_0.conda 121 KB --------------------------------------------------- Total: 66.1 MB Removed ruamel_yaml-0.15.87-py38h7b6447c_0.conda Removed python-3.8.5-h7579374_1.conda Removed tqdm-4.42.1-py_0.conda Removed certifi-2020.12.5-py38h06a4308_0.conda Removed conda-4.9.2-py38h06a4308_0.conda Removed libffi-3.3-he6710b0_2.conda Removed pysocks-1.7.1-py38h06a4308_0.conda Removed six-1.15.0-py38h06a4308_0.conda Removed pyopenssl-20.0.1-pyhd3eb1b0_1.conda Removed ncurses-6.2-he6710b0_1.conda Removed pycparser-2.20-py_2.conda Removed urllib3-1.26.2-pyhd3eb1b0_0.conda Removed chardet-4.0.0-py38h06a4308_1003.conda Removed readline-8.0-h7b6447c_0.conda Removed conda-package-handling-1.7.2-py38h03888b9_0.conda Removed brotlipy-0.7.0-py38h27cfd23_1003.conda Removed idna-2.10-py_0.conda Removed requests-2.25.1-pyhd3eb1b0_0.conda Removed libedit-3.1.20191231-h14c3975_1.conda Removed cryptography-3.3.1-py38h3c74f83_0.conda Removed cffi-1.14.4-py38h261ae71_0.conda Removed xz-5.2.5-h7b6447c_0.conda Removed sqlite-3.33.0-h62c20be_0.conda Removed openssl-1.1.1i-h27cfd23_0.conda Removed setuptools-51.1.2-py38h06a4308_4.conda Removed wheel-0.36.2-pyhd3eb1b0_0.conda Removed pip-20.3.3-py38h06a4308_0.conda Removed tk-8.6.10-hbc83047_0.conda Removed pycosat-0.6.3-py38h7b6447c_1.conda Removed ca-certificates-2020.12.8-h06a4308_0.conda WARNING: /root/.conda/pkgs does not exist Cache location: There are no unused packages to remove --> 0f0a24ac37a STEP 4: SHELL ["conda", "run", "-n", "base", "/bin/bash", "-c"] --> 0abc8927c4c STEP 5: COPY requirements.txt requirements.txt --> ca0d0588d9a STEP 6: RUN pip install -r requirements.txt && rm requirements.txt Collecting dask Downloading dask-2020.12.0-py3-none-any.whl (884 kB) Collecting distributed Downloading distributed-2020.12.0-py3-none-any.whl (669 kB) Requirement already satisfied: setuptools in /opt/conda/lib/python3.8/site-packages (from distributed->-r requirements.txt (line 2)) (51.1.2.post20210112) Collecting click>=6.6 Downloading click-7.1.2-py2.py3-none-any.whl (82 kB) Collecting cloudpickle>=1.5.0 Downloading cloudpickle-1.6.0-py3-none-any.whl (23 kB) Collecting msgpack>=0.6.0 Downloading msgpack-1.0.2-cp38-cp38-manylinux1_x86_64.whl (302 kB) Collecting psutil>=5.0 Downloading psutil-5.8.0-cp38-cp38-manylinux2010_x86_64.whl (296 kB) Collecting sortedcontainers!=2.0.0,!=2.0.1 Downloading sortedcontainers-2.3.0-py2.py3-none-any.whl (29 kB) Collecting tblib>=1.6.0 Downloading tblib-1.7.0-py2.py3-none-any.whl (12 kB) Collecting toolz>=0.8.2 Downloading toolz-0.11.1-py3-none-any.whl (55 kB) Collecting tornado>=6.0.3 Downloading tornado-6.1-cp38-cp38-manylinux2010_x86_64.whl (427 kB) Collecting zict>=0.1.3 Downloading zict-2.0.0-py3-none-any.whl (10 kB) Collecting heapdict Downloading HeapDict-1.0.1-py3-none-any.whl (3.9 kB) Collecting pyyaml Downloading PyYAML-5.3.1.tar.gz (269 kB) Building wheels for collected packages: pyyaml Building wheel for pyyaml (setup.py): started Building wheel for pyyaml (setup.py): finished with status 'done' Created wheel for pyyaml: filename=PyYAML-5.3.1-cp38-cp38-linux_x86_64.whl size=44618 sha256=eb43f80222895c1622443e94611b5d0c7ebea9beb221fe9fd919b7c7fad6c1ec Stored in directory: /root/.cache/pip/wheels/13/90/db/290ab3a34f2ef0b5a0f89235dc2d40fea83e77de84ed2dc05c Successfully built pyyaml Installing collected packages: pyyaml, heapdict, zict, tornado, toolz, tblib, sortedcontainers, psutil, msgpack, dask, cloudpickle, click, distributed Successfully installed click-7.1.2 cloudpickle-1.6.0 dask-2020.12.0 distributed-2020.12.0 heapdict-1.0.1 msgpack-1.0.2 psutil-5.8.0 pyyaml-5.3.1 sortedcontainers-2.3.0 tblib-1.7.0 toolz-0.11.1 tornado-6.1 zict-2.0.0 STEP 7: COMMIT b6de8c97-b4bc-4c71-931a-40d8c7f1a3bc --> af9b5517008 af9b55170085e86605d65b0e9817afd5dd9952065e842cb7b52e3088bc3d0ea6 Completed short name "coiled/default" with unqualified-search registries (origin: /etc/containers/registries.conf) Getting image source signatures Copying blob sha256:9c388eb6d33c40662539172f0d9a357287bd1cd171692ca5c08db2886bc527c3 Copying blob sha256:b91f1f6726b6c56b24216f14b6048fe20b111850c4f99c286f7c96bc15f59016 Copying blob sha256:68ced04f60ab5c7a5f1d0b0b4e7572c5a4c8cce44866513d30d9df1a15277d6b Copying blob sha256:96cf53b3a9dd496f4c91ab86eeadca2c8a31210c2e5c2a82badbb0dcf8c8f76b Copying config sha256:5240001adf05380912c5d6fb27b70ac234e8e26aceb938cc7b99e6af8f3ebc40 Writing manifest to image destination Storing signatures Docker build succeeded: b6de8c97-b4bc-4c71-931a-40d8c7f1a3bc Uploading image Getting image source signatures Copying blob sha256:6576ca3a39c3bf2b3b904f04c7c10c3472cee2cd0c9f18b18f9022920e4ac5d5 Copying blob sha256:42851133c65c2a965b74a272b1737184e3a0b40c2b02ea29a6a9e9dc45d43971 Copying blob sha256:f2cb0ecef392f2a630fa1205b874ab2e2aedf96de04d0b8838e4e728e28142da Copying blob sha256:875120aa853cf59c6c5bc24af9f448a55f9b64db0bab58c9ee18f8a92ed8ac33 Copying blob sha256:fcd8d39597dd39d0c68670479e4d240fa9dba04a1246587384df9e1aa31b17d4 Copying blob sha256:ac04535705d30e14d533b743604d88dc5565857e0f76b23cb3e8ffae30a2f41e Copying blob sha256:5f177fb901dd4e1376702d6aa5fc016b921b71bcb3db446ef68a59429fc6fa8b Copying blob sha256:02b244f8b383765f4ca4a615c4eaf8f688f5f84279608fd9e3d6b7508830a4f7 Copying config sha256:af9b55170085e86605d65b0e9817afd5dd9952065e842cb7b52e3088bc3d0ea6 Writing manifest to image destination Storing signatures ```

Inspecting ECR again, I found that the new image that was created was actually stored in us-east-2 (our default region) not us-east-1. It was also tagged with a different tag than the previous image (the new image was tagged with b6de8c97-b4bc-4c71-931a-40d8c7f1a3bc).

Ultimately there was an error in the cluster creation process and

ValueError: Unable to get security info, cluster status is unexpectedly STOPPED

was raised in the user Python session.

Digging a bit deeper, it turns out that while the cluster scheduler and workers tasks were launched in us-east-1, and attempting to pull their container image from our ECR in us-east-1, they were using the tag for the image in us-east-2. This mismatch resulted in a CannotPullContainerError error for the scheduler and worker tasks.

dantheman39 commented 3 years ago

@marcosmoyano if you wouldn't mind, which PR did you say should fix this?

marcosmoyano commented 3 years ago

@dantheman39 https://github.com/coiled/cloud/pull/1427

marcosmoyano commented 3 years ago

@dantheman39 / @jrbourbeau I created a cluster configuration which is an extra step, but I could verify that the fix works. Closing. Please re-open if needed.

jrbourbeau commented 3 years ago

Re-opening as a signal to users that this is still an issue they may encounter (though we've fixed it internally and will be pushing out that fix soon)

jrbourbeau commented 3 years ago

Just confirmed this issue has been resolved with the new coiled 0.0.33 release. Thanks all!