Closed jrbourbeau closed 3 years ago
@marcosmoyano if you wouldn't mind, which PR did you say should fix this?
@dantheman39 https://github.com/coiled/cloud/pull/1427
@dantheman39 / @jrbourbeau I created a cluster configuration which is an extra step, but I could verify that the fix works. Closing. Please re-open if needed.
Re-opening as a signal to users that this is still an issue they may encounter (though we've fixed it internally and will be pushing out that fix soon)
Just confirmed this issue has been resolved with the new coiled
0.0.33 release. Thanks all!
Offline some users reported an issue where specifying an AWS region for creating a software environment & cluster doesn't work as expected. Here's a minimal example:
The initial software environment creation step works as expected and, going to AWS ECR I can see the corresponding image in us-east-1. Note that internally each software environment image gets a unique tag associated with it. For this particular image it's
da5defc0-a00a-4c6b-ac04-ae6f8dccc573
and this matches up with what I see in ECR in us-east-1.Initial software environment creation output:
``` Updating software environment... Solving conda environment... Conda environment solved! Building Docker image (this takes a few minutes) STEP 1: FROM coiled/default:sha-af843e5 STEP 2: COPY environment.yml environment.yml --> Using cache 633b2a92fa6419f503fabd7677086c3d302dc151908f71770a6ef8d699dc2ce8 --> 633b2a92fa6 STEP 3: RUN conda env update -n base -f environment.yml && rm environment.yml && conda clean --all -y && echo "conda activate base" >> ~/.bashrc --> Using cache 96355762aff760e2053d74e3c3a49fdc6ff78d13bb926fd12ae47bb2db049590 --> 96355762aff STEP 4: SHELL ["conda", "run", "-n", "base", "/bin/bash", "-c"] --> Using cache fb5bd66b676f8adf00645bdfe8556edb33b5331e3e16b66e9355f634669d1d24 --> fb5bd66b676 STEP 5: COPY requirements.txt requirements.txt --> 51969745cbb STEP 6: RUN pip install -r requirements.txt && rm requirements.txt Collecting dask Downloading dask-2020.12.0-py3-none-any.whl (884 kB) Collecting distributed Downloading distributed-2020.12.0-py3-none-any.whl (669 kB) Requirement already satisfied: setuptools in /opt/conda/lib/python3.8/site-packages (from distributed->-r requirements.txt (line 2)) (51.1.2.post20210112) Collecting click>=6.6 Downloading click-7.1.2-py2.py3-none-any.whl (82 kB) Collecting cloudpickle>=1.5.0 Downloading cloudpickle-1.6.0-py3-none-any.whl (23 kB) Collecting msgpack>=0.6.0 Downloading msgpack-1.0.2-cp38-cp38-manylinux1_x86_64.whl (302 kB) Collecting psutil>=5.0 Downloading psutil-5.8.0-cp38-cp38-manylinux2010_x86_64.whl (296 kB) Collecting sortedcontainers!=2.0.0,!=2.0.1 Downloading sortedcontainers-2.3.0-py2.py3-none-any.whl (29 kB) Collecting tblib>=1.6.0 Downloading tblib-1.7.0-py2.py3-none-any.whl (12 kB) Collecting toolz>=0.8.2 Downloading toolz-0.11.1-py3-none-any.whl (55 kB) Collecting tornado>=6.0.3 Downloading tornado-6.1-cp38-cp38-manylinux2010_x86_64.whl (427 kB) Collecting zict>=0.1.3 Downloading zict-2.0.0-py3-none-any.whl (10 kB) Collecting heapdict Downloading HeapDict-1.0.1-py3-none-any.whl (3.9 kB) Collecting pyyaml Downloading PyYAML-5.3.1.tar.gz (269 kB) Building wheels for collected packages: pyyaml Building wheel for pyyaml (setup.py): started Building wheel for pyyaml (setup.py): finished with status 'done' Created wheel for pyyaml: filename=PyYAML-5.3.1-cp38-cp38-linux_x86_64.whl size=44618 sha256=c561f874e7d2d6b89e2f36283050fa29d949d2f31d7bd9c84ce40ed170f71c60 Stored in directory: /root/.cache/pip/wheels/13/90/db/290ab3a34f2ef0b5a0f89235dc2d40fea83e77de84ed2dc05c Successfully built pyyaml Installing collected packages: pyyaml, heapdict, zict, tornado, toolz, tblib, sortedcontainers, psutil, msgpack, dask, cloudpickle, click, distributed Successfully installed click-7.1.2 cloudpickle-1.6.0 dask-2020.12.0 distributed-2020.12.0 heapdict-1.0.1 msgpack-1.0.2 psutil-5.8.0 pyyaml-5.3.1 sortedcontainers-2.3.0 tblib-1.7.0 toolz-0.11.1 tornado-6.1 zict-2.0.0 STEP 7: COMMIT da5defc0-a00a-4c6b-ac04-ae6f8dccc573 --> 5faad3abc31 5faad3abc31529a0ebd653cfe6fd6172b284c73459c761773284ebd3a1efada9 Docker build succeeded: da5defc0-a00a-4c6b-ac04-ae6f8dccc573 Uploading image Getting image source signatures Copying blob sha256:fd6130bf640d70726907154d1b535aebac02b53812df010f061f99d0b463779c Copying blob sha256:42851133c65c2a965b74a272b1737184e3a0b40c2b02ea29a6a9e9dc45d43971 Copying blob sha256:f2cb0ecef392f2a630fa1205b874ab2e2aedf96de04d0b8838e4e728e28142da Copying blob sha256:875120aa853cf59c6c5bc24af9f448a55f9b64db0bab58c9ee18f8a92ed8ac33 Copying blob sha256:fcd8d39597dd39d0c68670479e4d240fa9dba04a1246587384df9e1aa31b17d4 Copying blob sha256:33f493021a41ddb00dc033aaead880574243c055812e29d5ac785a34c4928648 Copying blob sha256:995d7fa2ff2cc9010b9aa7de17dd529d67b92a9d952eb2e0fd8e65b352ad7ed8 Copying blob sha256:a6ad50d7634d107ba35b7397f87a12a6061b7d600e59374a5a93399a5b28940b Copying config sha256:5faad3abc31529a0ebd653cfe6fd6172b284c73459c761773284ebd3a1efada9 Writing manifest to image destination Storing signatures Finished updating environment ```However when we get to the cluster creation process, Coiled is not able to find the image for the requested software environment and instead re-builds the software environment.
Cluster creation output:
``` Creating Cluster. This takes about a minute ...Checking environment images Software environment not found, rebuilding. Building Docker image (this takes a few minutes) STEP 1: FROM coiled/default:sha-af843e5 STEP 2: COPY environment.yml environment.yml --> b7361a2c57e STEP 3: RUN conda env update -n base -f environment.yml && rm environment.yml && conda clean --all -y && echo "conda activate base" >> ~/.bashrc Collecting package metadata (repodata.json): ...working... done Solving environment: ...working... done Downloading and Extracting Packages ruamel_yaml-0.15.87 | 259 KB | ########## | 100% python-3.8.5 | 49.3 MB | ########## | 100% tqdm-4.42.1 | 56 KB | ########## | 100% certifi-2020.12.5 | 141 KB | ########## | 100% conda-4.9.2 | 2.9 MB | ########## | 100% libffi-3.3 | 50 KB | ########## | 100% pysocks-1.7.1 | 31 KB | ########## | 100% six-1.15.0 | 27 KB | ########## | 100% pyopenssl-20.0.1 | 49 KB | ########## | 100% ncurses-6.2 | 817 KB | ########## | 100% pycparser-2.20 | 94 KB | ########## | 100% urllib3-1.26.2 | 105 KB | ########## | 100% chardet-4.0.0 | 194 KB | ########## | 100% readline-8.0 | 356 KB | ########## | 100% conda-package-handli | 886 KB | ########## | 100% brotlipy-0.7.0 | 323 KB | ########## | 100% idna-2.10 | 50 KB | ########## | 100% requests-2.25.1 | 52 KB | ########## | 100% libedit-3.1.20191231 | 116 KB | ########## | 100% cryptography-3.3.1 | 566 KB | ########## | 100% cffi-1.14.4 | 226 KB | ########## | 100% xz-5.2.5 | 341 KB | ########## | 100% sqlite-3.33.0 | 1.1 MB | ########## | 100% openssl-1.1.1i | 2.5 MB | ########## | 100% setuptools-51.1.2 | 742 KB | ########## | 100% wheel-0.36.2 | 33 KB | ########## | 100% pip-20.3.3 | 1.8 MB | ########## | 100% tk-8.6.10 | 3.0 MB | ########## | 100% pycosat-0.6.3 | 82 KB | ########## | 100% ca-certificates-2020 | 121 KB | ########## | 100% Preparing transaction: ...working... done Verifying transaction: ...working... done Executing transaction: ...working... done # # To activate this environment, use # # $ conda activate base # # To deactivate an active environment, use # # $ conda deactivate Cache location: /opt/conda/pkgs Will remove the following tarballs: /opt/conda/pkgs --------------- ruamel_yaml-0.15.87-py38h7b6447c_0.conda 259 KB python-3.8.5-h7579374_1.conda 49.3 MB tqdm-4.42.1-py_0.conda 56 KB certifi-2020.12.5-py38h06a4308_0.conda 141 KB conda-4.9.2-py38h06a4308_0.conda 2.9 MB libffi-3.3-he6710b0_2.conda 50 KB pysocks-1.7.1-py38h06a4308_0.conda 31 KB six-1.15.0-py38h06a4308_0.conda 27 KB pyopenssl-20.0.1-pyhd3eb1b0_1.conda 49 KB ncurses-6.2-he6710b0_1.conda 817 KB pycparser-2.20-py_2.conda 94 KB urllib3-1.26.2-pyhd3eb1b0_0.conda 105 KB chardet-4.0.0-py38h06a4308_1003.conda 194 KB readline-8.0-h7b6447c_0.conda 356 KB conda-package-handling-1.7.2-py38h03888b9_0.conda 886 KB brotlipy-0.7.0-py38h27cfd23_1003.conda 323 KB idna-2.10-py_0.conda 50 KB requests-2.25.1-pyhd3eb1b0_0.conda 52 KB libedit-3.1.20191231-h14c3975_1.conda 116 KB cryptography-3.3.1-py38h3c74f83_0.conda 566 KB cffi-1.14.4-py38h261ae71_0.conda 226 KB xz-5.2.5-h7b6447c_0.conda 341 KB sqlite-3.33.0-h62c20be_0.conda 1.1 MB openssl-1.1.1i-h27cfd23_0.conda 2.5 MB setuptools-51.1.2-py38h06a4308_4.conda 742 KB wheel-0.36.2-pyhd3eb1b0_0.conda 33 KB pip-20.3.3-py38h06a4308_0.conda 1.8 MB tk-8.6.10-hbc83047_0.conda 3.0 MB pycosat-0.6.3-py38h7b6447c_1.conda 82 KB ca-certificates-2020.12.8-h06a4308_0.conda 121 KB --------------------------------------------------- Total: 66.1 MB Removed ruamel_yaml-0.15.87-py38h7b6447c_0.conda Removed python-3.8.5-h7579374_1.conda Removed tqdm-4.42.1-py_0.conda Removed certifi-2020.12.5-py38h06a4308_0.conda Removed conda-4.9.2-py38h06a4308_0.conda Removed libffi-3.3-he6710b0_2.conda Removed pysocks-1.7.1-py38h06a4308_0.conda Removed six-1.15.0-py38h06a4308_0.conda Removed pyopenssl-20.0.1-pyhd3eb1b0_1.conda Removed ncurses-6.2-he6710b0_1.conda Removed pycparser-2.20-py_2.conda Removed urllib3-1.26.2-pyhd3eb1b0_0.conda Removed chardet-4.0.0-py38h06a4308_1003.conda Removed readline-8.0-h7b6447c_0.conda Removed conda-package-handling-1.7.2-py38h03888b9_0.conda Removed brotlipy-0.7.0-py38h27cfd23_1003.conda Removed idna-2.10-py_0.conda Removed requests-2.25.1-pyhd3eb1b0_0.conda Removed libedit-3.1.20191231-h14c3975_1.conda Removed cryptography-3.3.1-py38h3c74f83_0.conda Removed cffi-1.14.4-py38h261ae71_0.conda Removed xz-5.2.5-h7b6447c_0.conda Removed sqlite-3.33.0-h62c20be_0.conda Removed openssl-1.1.1i-h27cfd23_0.conda Removed setuptools-51.1.2-py38h06a4308_4.conda Removed wheel-0.36.2-pyhd3eb1b0_0.conda Removed pip-20.3.3-py38h06a4308_0.conda Removed tk-8.6.10-hbc83047_0.conda Removed pycosat-0.6.3-py38h7b6447c_1.conda Removed ca-certificates-2020.12.8-h06a4308_0.conda WARNING: /root/.conda/pkgs does not exist Cache location: There are no unused packages to remove --> 0f0a24ac37a STEP 4: SHELL ["conda", "run", "-n", "base", "/bin/bash", "-c"] --> 0abc8927c4c STEP 5: COPY requirements.txt requirements.txt --> ca0d0588d9a STEP 6: RUN pip install -r requirements.txt && rm requirements.txt Collecting dask Downloading dask-2020.12.0-py3-none-any.whl (884 kB) Collecting distributed Downloading distributed-2020.12.0-py3-none-any.whl (669 kB) Requirement already satisfied: setuptools in /opt/conda/lib/python3.8/site-packages (from distributed->-r requirements.txt (line 2)) (51.1.2.post20210112) Collecting click>=6.6 Downloading click-7.1.2-py2.py3-none-any.whl (82 kB) Collecting cloudpickle>=1.5.0 Downloading cloudpickle-1.6.0-py3-none-any.whl (23 kB) Collecting msgpack>=0.6.0 Downloading msgpack-1.0.2-cp38-cp38-manylinux1_x86_64.whl (302 kB) Collecting psutil>=5.0 Downloading psutil-5.8.0-cp38-cp38-manylinux2010_x86_64.whl (296 kB) Collecting sortedcontainers!=2.0.0,!=2.0.1 Downloading sortedcontainers-2.3.0-py2.py3-none-any.whl (29 kB) Collecting tblib>=1.6.0 Downloading tblib-1.7.0-py2.py3-none-any.whl (12 kB) Collecting toolz>=0.8.2 Downloading toolz-0.11.1-py3-none-any.whl (55 kB) Collecting tornado>=6.0.3 Downloading tornado-6.1-cp38-cp38-manylinux2010_x86_64.whl (427 kB) Collecting zict>=0.1.3 Downloading zict-2.0.0-py3-none-any.whl (10 kB) Collecting heapdict Downloading HeapDict-1.0.1-py3-none-any.whl (3.9 kB) Collecting pyyaml Downloading PyYAML-5.3.1.tar.gz (269 kB) Building wheels for collected packages: pyyaml Building wheel for pyyaml (setup.py): started Building wheel for pyyaml (setup.py): finished with status 'done' Created wheel for pyyaml: filename=PyYAML-5.3.1-cp38-cp38-linux_x86_64.whl size=44618 sha256=eb43f80222895c1622443e94611b5d0c7ebea9beb221fe9fd919b7c7fad6c1ec Stored in directory: /root/.cache/pip/wheels/13/90/db/290ab3a34f2ef0b5a0f89235dc2d40fea83e77de84ed2dc05c Successfully built pyyaml Installing collected packages: pyyaml, heapdict, zict, tornado, toolz, tblib, sortedcontainers, psutil, msgpack, dask, cloudpickle, click, distributed Successfully installed click-7.1.2 cloudpickle-1.6.0 dask-2020.12.0 distributed-2020.12.0 heapdict-1.0.1 msgpack-1.0.2 psutil-5.8.0 pyyaml-5.3.1 sortedcontainers-2.3.0 tblib-1.7.0 toolz-0.11.1 tornado-6.1 zict-2.0.0 STEP 7: COMMIT b6de8c97-b4bc-4c71-931a-40d8c7f1a3bc --> af9b5517008 af9b55170085e86605d65b0e9817afd5dd9952065e842cb7b52e3088bc3d0ea6 Completed short name "coiled/default" with unqualified-search registries (origin: /etc/containers/registries.conf) Getting image source signatures Copying blob sha256:9c388eb6d33c40662539172f0d9a357287bd1cd171692ca5c08db2886bc527c3 Copying blob sha256:b91f1f6726b6c56b24216f14b6048fe20b111850c4f99c286f7c96bc15f59016 Copying blob sha256:68ced04f60ab5c7a5f1d0b0b4e7572c5a4c8cce44866513d30d9df1a15277d6b Copying blob sha256:96cf53b3a9dd496f4c91ab86eeadca2c8a31210c2e5c2a82badbb0dcf8c8f76b Copying config sha256:5240001adf05380912c5d6fb27b70ac234e8e26aceb938cc7b99e6af8f3ebc40 Writing manifest to image destination Storing signatures Docker build succeeded: b6de8c97-b4bc-4c71-931a-40d8c7f1a3bc Uploading image Getting image source signatures Copying blob sha256:6576ca3a39c3bf2b3b904f04c7c10c3472cee2cd0c9f18b18f9022920e4ac5d5 Copying blob sha256:42851133c65c2a965b74a272b1737184e3a0b40c2b02ea29a6a9e9dc45d43971 Copying blob sha256:f2cb0ecef392f2a630fa1205b874ab2e2aedf96de04d0b8838e4e728e28142da Copying blob sha256:875120aa853cf59c6c5bc24af9f448a55f9b64db0bab58c9ee18f8a92ed8ac33 Copying blob sha256:fcd8d39597dd39d0c68670479e4d240fa9dba04a1246587384df9e1aa31b17d4 Copying blob sha256:ac04535705d30e14d533b743604d88dc5565857e0f76b23cb3e8ffae30a2f41e Copying blob sha256:5f177fb901dd4e1376702d6aa5fc016b921b71bcb3db446ef68a59429fc6fa8b Copying blob sha256:02b244f8b383765f4ca4a615c4eaf8f688f5f84279608fd9e3d6b7508830a4f7 Copying config sha256:af9b55170085e86605d65b0e9817afd5dd9952065e842cb7b52e3088bc3d0ea6 Writing manifest to image destination Storing signatures ```Inspecting ECR again, I found that the new image that was created was actually stored in us-east-2 (our default region) not us-east-1. It was also tagged with a different tag than the previous image (the new image was tagged with
b6de8c97-b4bc-4c71-931a-40d8c7f1a3bc
).Ultimately there was an error in the cluster creation process and
was raised in the user Python session.
Digging a bit deeper, it turns out that while the cluster scheduler and workers tasks were launched in us-east-1, and attempting to pull their container image from our ECR in us-east-1, they were using the tag for the image in us-east-2. This mismatch resulted in a
CannotPullContainerError
error for the scheduler and worker tasks.