ageron / handson-ml3

A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.
Apache License 2.0
6.66k stars 2.7k forks source link

[BUG] Unable to run `docker compose` following the instructions: /opt/conda/envs/homl3/bin/jupyter directory missing #142

Open vasigorc opened 1 week ago

vasigorc commented 1 week ago

Describe the bug I use a GPU powered Linux laptop and I couldn't successfully run the docker compose scenario.

Here are my prerequisites:

# docker is installed
~ docker --version
Docker version 26.1.4, build 5650f9b

# so is the docker compose plugin
~ docker compose version
Docker Compose version v2.27.1

# nvidia container toolkit is intalled
~ dpkg -l | grep nvidia-container-toolkit
ii  nvidia-container-toolkit                          1.12.1-0pop1~1679409890~22.04~5f4b1f2                             amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base                     1.12.1-0pop1~1679409890~22.04~5f4b1f2                             amd64        NVIDIA Container Toolkit Base

# and configured
~ cat /etc/docker/daemon.json
{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}%  

# nvidia container toolkit sample workload working
docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Unable to find image 'ubuntu:latest' locally
latest: Pulling from library/ubuntu
9c704ecd0c69: Pull complete
Digest: sha256:2e863c44b718727c860746568e1d54afd13b2fa71b160f5cd9058fc436217b30
Status: Downloaded newer image for ubuntu:latest
Thu Jun 20 02:37:30 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4080 ...    Off |   00000000:02:00.0 Off |                  N/A |
| N/A   46C    P8              4W /  150W |     122MiB /  12282MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

# ML compatible GPU is availble
nvidia-smi
Wed Jun 19 21:13:15 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67                 Driver Version: 550.67         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4080 ...    Off |   00000000:02:00.0 Off |                  N/A |
| N/A   51C    P8              6W /  150W |     122MiB /  12282MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      3440      G   /usr/lib/xorg/Xorg                             18MiB |
|    0   N/A  N/A      9091    C+G   warp-terminal                                  91MiB |
+-----------------------------------------------------------------------------------------+

# made the required GPU related changes in `docker-compose.yml`
diff --git a/docker/docker-compose.yml b/docker/docker-compose.yml
index d8893d9..8ca7305 100644
--- a/docker/docker-compose.yml
+++ b/docker/docker-compose.yml
@@ -1,14 +1,16 @@
+# Copied from https://github.com/ageron/handson-ml3/blob/main/docker/docker-compose.yml
+# Modification instructions copied from https://github.com/ageron/handson-ml3/tree/main/docker#prerequisites-1
 version: "3"
 services:
   handson-ml3:
     build:
       context: ../
-      dockerfile: ./docker/Dockerfile #Dockerfile.gpu
+      dockerfile: ./docker/Dockerfile.gpu
       args:
         - username=devel
         - userid=1000
     container_name: handson-ml3
-    image: ageron/handson-ml3:latest #latest-gpu
+    image: ageron/handson-ml3:latest-gpu
     restart: unless-stopped
     logging:
       driver: json-file
@@ -20,8 +22,8 @@ services:
     volumes:
       - ../:/home/devel/handson-ml3
     command: /opt/conda/envs/homl3/bin/jupyter lab --ip='0.0.0.0' --port=8888 --no-browser
-    #deploy:
-    #  resources:
-    #    reservations:
-    #      devices:
-    #      - capabilities: [gpu]
+    deploy:
+     resources:
+       reservations:
+         devices:
+         - capabilities: [gpu]
\ No newline at end of file

To Reproduce

  1. Use a POP!_OS or Ubuntu 22.04 LTS
  2. Install the prerequisites
  3. Download ml3 code repository
  4. Make the required changes
  5. Run docker compose up from docker directory

Here is the output:

docker compose up
WARN[0000] /home/vasilegorcinschi/repos/handson-ml3/docker/docker-compose.yml: `version` is obsolete
Attaching to handson-ml3
Gracefully stopping... (press Ctrl+C again to force)
Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "/opt/conda/envs/homl3/bin/jupyter": stat /opt/conda/envs/homl3/bin/jupyter: no such file or directory: unknown

Expected behavior The docker container to start

Versions (please complete the following information):

vasigorc commented 1 week ago

An investigation detail that could be useful: running the image and inspecting the container with bash I don't see conda installed, which explains the above error?

~ docker run -it --rm --runtime=nvidia --gpus all ageron/handson-ml3:latest-gpu /bin/bash

________                               _______________
___  __/__________________________________  ____/__  /________      __
__  /  _  _ \_  __ \_  ___/  __ \_  ___/_  /_   __  /_  __ \_ | /| / /
_  /   /  __/  / / /(__  )/ /_/ /  /   _  __/   _  / / /_/ /_ |/ |/ /
/_/    \___//_/ /_//____/ \____//_/    /_/      /_/  \____/____/|__/

You are running this container as user with ID 1000 and group 1000,
which should map to the ID and group for your user on the Docker host. Great!

/sbin/ldconfig.real: Can't create temporary cache file /etc/ld.so.cache~: Permission denied
devel@99e4901df358:~/handson-ml3$ conda env list
bash: conda: command not found

Not sure why conda is not installed: I pulled the Docker image (didn't build it locally).

FWIW jupyter is installed inside container at this path:

~ which jupyter
/usr/local/bin/jupyter
vasigorc commented 6 days ago

Digging down further I faced a similar issue when building the image locally:

This PR fixes the issue: https://github.com/ageron/handson-ml3/pull/144

Here is a sample output:

docker compose up
WARN[0000] /home/vasilegorcinschi/repos/handson-ml3/docker/docker-compose.yml: `version` is obsolete
[+] Running 1/1
 ✔ Container handson-ml3  Created                                                                                                                                                                                                                                       0.1s
Attaching to handson-ml3
handson-ml3  | [I 2024-06-22 00:56:45.071 ServerApp] jupyter_lsp | extension was successfully linked.
handson-ml3  | [I 2024-06-22 00:56:45.074 ServerApp] jupyter_server_mathjax | extension was successfully linked.
handson-ml3  | [I 2024-06-22 00:56:45.076 ServerApp] jupyter_server_terminals | extension was successfully linked.
handson-ml3  | [I 2024-06-22 00:56:45.079 ServerApp] jupyterlab | extension was successfully linked.
handson-ml3  | [I 2024-06-22 00:56:45.079 ServerApp] nbdime | extension was successfully linked.
handson-ml3  | [I 2024-06-22 00:56:45.080 ServerApp] Writing Jupyter server cookie secret to /home/devel/.local/share/jupyter/runtime/jupyter_cookie_secret
handson-ml3  | [I 2024-06-22 00:56:45.592 ServerApp] notebook_shim | extension was successfully linked.
handson-ml3  | [I 2024-06-22 00:56:45.611 ServerApp] notebook_shim | extension was successfully loaded.
handson-ml3  | [I 2024-06-22 00:56:45.613 ServerApp] jupyter_lsp | extension was successfully loaded.
handson-ml3  | [I 2024-06-22 00:56:45.613 ServerApp] jupyter_server_mathjax | extension was successfully loaded.
handson-ml3  | [I 2024-06-22 00:56:45.613 ServerApp] jupyter_server_terminals | extension was successfully loaded.
handson-ml3  | [I 2024-06-22 00:56:45.615 LabApp] JupyterLab extension loaded from /opt/conda/envs/homl3/lib/python3.10/site-packages/jupyterlab
handson-ml3  | [I 2024-06-22 00:56:45.615 LabApp] JupyterLab application directory is /opt/conda/envs/homl3/share/jupyter/lab
handson-ml3  | [I 2024-06-22 00:56:45.615 LabApp] Extension Manager is 'pypi'.
handson-ml3  | [I 2024-06-22 00:56:45.617 ServerApp] jupyterlab | extension was successfully loaded.
handson-ml3  | [I 2024-06-22 00:56:45.709 ServerApp] nbdime | extension was successfully loaded.
handson-ml3  | [I 2024-06-22 00:56:45.709 ServerApp] Serving notebooks from local directory: /home/devel/handson-ml3
handson-ml3  | [I 2024-06-22 00:56:45.709 ServerApp] Jupyter Server 2.14.1 is running at:
handson-ml3  | [I 2024-06-22 00:56:45.709 ServerApp] http://2674095b7bd8:8888/lab?token=1d798602e6f6fc421f80273a15b3b12d10a1d39e050942e0
handson-ml3  | [I 2024-06-22 00:56:45.709 ServerApp]     http://127.0.0.1:8888/lab?token=1d798602e6f6fc421f80273a15b3b12d10a1d39e050942e0
handson-ml3  | [I 2024-06-22 00:56:45.709 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
handson-ml3  | [C 2024-06-22 00:56:45.711 ServerApp]
handson-ml3  |
handson-ml3  |     To access the server, open this file in a browser:
handson-ml3  |         file:///home/devel/.local/share/jupyter/runtime/jpserver-1-open.html
handson-ml3  |     Or copy and paste one of these URLs:
handson-ml3  |         http://2674095b7bd8:8888/lab?token=1d798602e6f6fc421f80273a15b3b12d10a1d39e050942e0
handson-ml3  |         http://127.0.0.1:8888/lab?token=1d798602e6f6fc421f80273a15b3b12d10a1d39e050942e0
handson-ml3  | [I 2024-06-22 00:56:45.725 ServerApp] Skipped non-installed server(s): bash-language-server, dockerfile-language-server-nodejs, javascript-typescript-langserver, jedi-language-server, julia-language-server, pyright, python-language-server, python-lsp-server, r-languageserver, sql-language-server, texlab, typescript-language-server, unified-language-server, vscode-css-languageserver-bin, vscode-html-languageserver-bin, vscode-json-languageserver-bin, yaml-language-server
handson-ml3  | [W 2024-06-22 00:57:02.777 LabApp] Could not determine jupyterlab build status without nodejs
handson-ml3  | [I 2024-06-22 00:57:29.637 ServerApp] Writing notebook-signing key to /home/devel/.local/share/jupyter/notebook_secret
handson-ml3  | [W 2024-06-22 00:57:29.637 ServerApp] Notebook 01_the_machine_learning_landscape.ipynb is not trusted
handson-ml3  | [I 2024-06-22 00:57:30.064 ServerApp] Kernel started: ab6ef08f-0a04-4020-bc1e-72a766350767
handson-ml3  | [I 2024-06-22 00:57:31.438 ServerApp] Connecting to kernel ab6ef08f-0a04-4020-bc1e-72a766350767.
handson-ml3  | [I 2024-06-22 00:57:31.451 ServerApp] Connecting to kernel ab6ef08f-0a04-4020-bc1e-72a766350767.
handson-ml3  | [I 2024-06-22 00:57:31.464 ServerApp] Connecting to kernel ab6ef08f-0a04-4020-bc1e-72a766350767.
handson-ml3  | [I 2024-06-22 00:57:37.034 ServerApp] Starting buffering for ab6ef08f-0a04-4020-bc1e-72a766350767:4a9249d7-0e12-40e7-87fb-071b24a4de19

@ageron I don't have access to associate this issue to the PR or to assign you as a reviewer, but I'd apprecitate your review (and merge probably too, since only people with write access can merge).