Open jokokojote opened 1 year ago
I was able to reproduce this problem on my local machine which returned the same problem. Using:
hello, the most common cause for this symptom is a problem with master container connecting to the task (notebook/tensorboard) container to proxy the incoming request. this is often caused by firewalls or other networking setup issues.
OS: MacOS 13.5.2, Windows 11, Ubuntu
it's peculiar you see the issue on three different OS'es, and you both see the issue. are you working with one shared deployment? or have you each deployed determined separately on all of these OS'es?
can you tell me more on how did you deploy determined / which guide did you follow? (is it det deploy local
or others)
if you happen to share a corporate firewall / proxy setup, I'd recommend trying to temporarily disabling it to see if it helps.
For me I tried it on my private machine: Docker desktop was in use with "Use the WSL 2 based engine" for docker activated (so I did everything with sudo on my WSL (Ubuntu). For my local network I got just the windows-firewall activated and no additional anti-virus programs. The windows firewall just asked for the permission to allow docker-desktop to access the internet and since yet no container ever got problems.
I have no proxy activated.
I started the determined cluster using
det deploy local cluster-up
and also started a agent by running the docker-container
docker run \ -v /var/run/docker.sock:/var/run/docker.sock \ -v "$PWD"/agent.yaml:/etc/determined/agent.yaml \ determinedai/determined-agent:VERSION
which just worked fine.
When I start "Jupyter lab" using the UI - I see the container getting started and all pip-libs are being downloaded but it hangs itself when "Running" is printed on the screen as shown by @jokokojote
I'm root on my machine too.
Hello,
I tested it on different machines with different set ups to isolate respectively understand the problem.
At first, I indeed tried it on an ubuntu machine inside a cooperate network and run determined using the master, agent and db docker containers diretly (and passed proxy environment variables to the containers). The core functionalities like experiment initlization, (GPU) training, tuning, etc. worked like charm - jupyter and tensorboard did not, yielding the same logs I added in the issue description. Indeed firewall or proxy settings could be the issue here, even though I do not understand why the agent itself worked and no errors were shown in the logs fo tensorboard and jupyter.
Since jupyter and tensorboard it did not work on this machine I tried it on my cooperate laptop (Mac) but outside of the cooperate network and set up determined just with det deploy local cluster-up --no-gpu
. Same result: Core functions worked w/o any problems, jupyter and tensorboard did not.
Then I asked @KevinHubert-Dev to try it at home with a private machine and private network and he got the same results like he described.
It is highly unusual to see this happen on so many different setups. I'll need your help debugging it.
When you start a notebook, there'd be a "registering service" log line in the master logs, e.g.
INFO[2023-10-19T13:24:31-07:00] registering service: b8f0beb4-0c7a-4b70-b4a3-6bdcee294de1 (https://127.0.0.1:32903) component=proxy
you can docker exec -it <master container name> /bin/bash
into the master container, and try to curl --insecure <service url>
, e.g. curl --insecure https://127.0.0.1:32903
in this case. this should simulate what master does. if it works, weird. if it does not, we need to debug why. if you can't see this log line in the logs at all, please share the master logs.
I did what you suggested on my corporate laptop in my private network:
Start up with:
det deploy local cluster-up --no-gpu
Removing network determined_default
Creating network determined_default...
Creating determined_determined-db_1...
Waiting for determined_determined-db_1...
Creating determined_determined-master_1...
Waiting for master instance to be available....
Starting determined-agent-0
Master logs:
2023-10-20 11:07:56 INFO[2023-10-20T09:07:56Z] master configuration: {"config_file":"","log":{"level":"info","color":true},"db":{"user":"postgres","password":"********","migrations":"file:///usr/share/determined/master/static/migrations","host":"determined-db","port":"5432","name":"determined","ssl_mode":"disable","ssl_root_cert":""},"tensorboard_timeout":300,"notebook_timeout":null,"security":{"default_task":{"id":0,"user_id":0,"user":"root","uid":0,"group":"root","gid":0},"tls":{"cert":"","key":""},"ssh":{"rsa_key_size":1024},"authz":{"type":"basic","fallback":"basic","rbac_ui_enabled":null,"_strict_ntsc_enabled":false,"workspace_creator_assign_role":{"enabled":true,"role_id":2},"strict_job_queue_control":false}},"checkpoint_storage":{"host_path":"/Users/fero/Library/Application Support/determined","propagation":null,"save_experiment_best":0,"save_trial_best":1,"save_trial_latest":1,"storage_path":null,"type":"shared_fs"},"task_container_defaults":{"shm_size_bytes":4294967296,"network_mode":"bridge","cpu_pod_spec":null,"gpu_pod_spec":null,"add_capabilities":null,"drop_capabilities":null,"devices":null,"bind_mounts":null,"work_dir":null,"slurm":{},"pbs":{},"kubernetes":null},"port":8080,"root":"/usr/share/determined/master","telemetry":{"enabled":true,"segment_master_key":"********","otel_enabled":false,"otel_endpoint":"localhost:4317","segment_webui_key":"********","cluster_id":""},"enable_cors":false,"launch_error":true,"cluster_name":"","logging":{"type":"default"},"observability":{"enable_prometheus":false},"cache":{"cache_dir":"/var/cache/determined"},"webhooks":{"base_url":"","signing_key":"3dfc80a6eab1"},"feature_switches":[],"resource_manager":{"client_ca":"","default_aux_resource_pool":"default","default_compute_resource_pool":"default","no_default_resource_pools":false,"require_authentication":false,"scheduler":{"allow_heterogeneous_fits":false,"fitting_policy":"best","type":"fair_share"},"type":"agent"},"resource_pools":[{"pool_name":"default","description":"","provider":null,"max_aux_containers_per_agent":100,"task_container_defaults":null,"agent_reattach_enabled":false,"agent_reconnect_wait":"25s","kubernetes_namespace":""}],"__internal":{"audit_logging_enabled":false,"external_sessions":{"login_uri":"","logout_uri":"","jwt_key":""}}}
2023-10-20 11:07:56 INFO[2023-10-20T09:07:56Z] Determined master 0.26.1 (built with go1.21.0)
2023-10-20 11:07:56 INFO[2023-10-20T09:07:56Z] connecting to database determined-db:5432
2023-10-20 11:07:56 INFO[2023-10-20T09:07:56Z] running DB migrations from file:///usr/share/determined/master/static/migrations; this might take a while...
2023-10-20 11:07:56 INFO[2023-10-20T09:07:56Z] migrated from 0 to 20231006193809
2023-10-20 11:07:56 INFO[2023-10-20T09:07:56Z] DB migrations completed
2023-10-20 11:07:56 INFO[2023-10-20T09:07:56Z] deleting all snapshots for terminal state experiments
2023-10-20 11:07:56 INFO[2023-10-20T09:07:56Z] Generating a new CA certificate and key
2023-10-20 11:07:57 INFO[2023-10-20T09:07:57Z] Saved certificate and key to DB
2023-10-20 11:07:57 INFO[2023-10-20T09:07:57Z] Generating a new certificate and key for master
2023-10-20 11:07:57 INFO[2023-10-20T09:07:57Z] Saved certificate and key to DB
2023-10-20 11:07:57 INFO[2023-10-20T09:07:57Z] creating resource pool: default actor-local-addr=agentRM actor-system=master go-type=agentResourceManager
2023-10-20 11:07:57 INFO[2023-10-20T09:07:57Z] pool default using global scheduling config actor-local-addr=agentRM actor-system=master go-type=agentResourceManager
2023-10-20 11:07:57 INFO[2023-10-20T09:07:57Z] not enabling provisioner for resource pool: default actor-local-addr=default actor-system=master go-type=resourcePool resource-pool=default
2023-10-20 11:07:57 INFO[2023-10-20T09:07:57Z] scheduling next resource allocation aggregation in 14h53m2s at 2023-10-21 00:01:00 +0000 UTC actor-local-addr=allocation-aggregator actor-system=master go-type=allocationAggregator
2023-10-20 11:07:57 INFO[2023-10-20T09:07:57Z] telemetry reporting is enabled; run with --telemetry-enabled=false to disable component=telemetry
2023-10-20 11:07:57 INFO[2023-10-20T09:07:57Z] accepting incoming connections on port 8080
2023-10-20 11:08:07 INFO[2023-10-20T09:08:07Z] resource pool is empty; using default resource pool: default actor-local-addr=agents actor-system=master go-type=agents
2023-10-20 11:08:07 INFO[2023-10-20T09:08:07Z] agent connected ip: 172.18.0.1 resource pool: default slots: 1 actor-local-addr=determined-agent-0 actor-system=master go-type=agent
2023-10-20 11:08:07 INFO[2023-10-20T09:08:07Z] adding device: cpu0 ( x 6 cores) on determined-agent-0 actor-local-addr=determined-agent-0 actor-system=master go-type=agent
2023-10-20 11:08:07 INFO[2023-10-20T09:08:07Z] adding agent: determined-agent-0 actor-local-addr=default actor-system=master agent-id=determined-agent-0 go-type=resourcePool resource-pool=default
2023-10-20 11:09:14 INFO[2023-10-20T09:09:14Z] resources are requested by JupyterLab (duly-strong-piglet) (Allocation ID: c293a8c8-31c6-4d83-a4ac-70a40e5c057b.1) actor-local-addr=default actor-system=master allocation-id=c293a8c8-31c6-4d83-a4ac-70a40e5c057b.1 go-type=resourcePool resource-pool=default restore=false restoring=false
2023-10-20 11:09:14 INFO[2023-10-20T09:09:14Z] allocated resources to JupyterLab (duly-strong-piglet) actor-local-addr=default actor-system=master go-type=resourcePool resource-pool=default
2023-10-20 11:09:14 INFO[2023-10-20T09:09:14Z] 1 resources allocated job-id=2765d3da-4e08-4494-ae50-a0d359dff301 restore=false task-id=c293a8c8-31c6-4d83-a4ac-70a40e5c057b task-type=NOTEBOOK
2023-10-20 11:09:14 INFO[2023-10-20T09:09:14Z] starting container actor-local-addr=determined-agent-0 actor-system=master allocation-id=c293a8c8-31c6-4d83-a4ac-70a40e5c057b.1 container-id=3223bcc3-4c4b-4294-b95e-78268f622808 go-type=agent job-id=2765d3da-4e08-4494-ae50-a0d359dff301 slots=1 task-id=c293a8c8-31c6-4d83-a4ac-70a40e5c057b task-type=NOTEBOOK
2023-10-20 11:12:53 INFO[2023-10-20T09:12:53Z] registering service: c293a8c8-31c6-4d83-a4ac-70a40e5c057b (https://172.18.0.1:32768) component=proxy
2023-10-20 11:15:14 2023/10/20 09:15:14 http: proxy error: dial tcp 172.18.0.1:32768: connect: connection timed out
Curl inside master gets timeout:
curl --insecure https://172.18.0.1:32768
curl: (28) Failed to connect to 172.18.0.1 port 32768 after 130208 ms: Connection timed out
Jupyter container logs:
2023-10-20 11:12:55 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
2023-10-20 11:12:55 INFO: [31] root: detected 0 gpus (nvidia-smi not found)
2023-10-20 11:12:55 INFO: [31] root: rocm-smi not found
2023-10-20 11:12:55 INFO: [31] root: Running task container on agent_id=determined-agent-0, hostname=f9941fc1ee0b with visible GPUs []
2023-10-20 11:12:55 INFO: [31] root: detected 0 gpu processes (nvidia-smi not found)
2023-10-20 11:12:55 + test -f startup-hook.sh
2023-10-20 11:12:55 + set +x
2023-10-20 11:12:56 WARNING: [ServerApp] ServerApp.token config is deprecated in 2.0. Use IdentityProvider.token.
2023-10-20 11:12:56 INFO: [ServerApp] Package jupyterlab took 0.0000s to import
2023-10-20 11:12:56 INFO: [ServerApp] Package jupyter_archive took 0.0008s to import
2023-10-20 11:12:56 INFO: [ServerApp] Package jupyter_server_terminals took 0.0025s to import
2023-10-20 11:12:56 INFO: [ServerApp] Package nbclassic took 0.0000s to import
2023-10-20 11:12:56 WARNING: [ServerApp] A `_jupyter_server_extension_points` function was not found in nbclassic. Instead, a `_jupyter_server_extension_paths` function was found and will be used for now. This function name will be deprecated in future releases of Jupyter Server.
2023-10-20 11:12:56 INFO: [ServerApp] Package notebook_shim took 0.0000s to import
2023-10-20 11:12:56 WARNING: [ServerApp] A `_jupyter_server_extension_points` function was not found in notebook_shim. Instead, a `_jupyter_server_extension_paths` function was found and will be used for now. This function name will be deprecated in future releases of Jupyter Server.
2023-10-20 11:12:56 INFO: [ServerApp] jupyter_archive | extension was successfully linked.
2023-10-20 11:12:56 INFO: [ServerApp] jupyter_server_terminals | extension was successfully linked.
2023-10-20 11:12:56 INFO: [ServerApp] jupyterlab | extension was successfully linked.
2023-10-20 11:12:56 INFO: [ServerApp] nbclassic | extension was successfully linked.
2023-10-20 11:12:56 INFO: [ServerApp] Writing Jupyter server cookie secret to /run/determined/jupyter/runtime/jupyter_cookie_secret
2023-10-20 11:12:56 INFO: [ServerApp] notebook_shim | extension was successfully linked.
2023-10-20 11:12:56 WARNING: [ServerApp] All authentication is disabled. Anyone who can connect to this server will be able to run code.
2023-10-20 11:12:56 INFO: [ServerApp] notebook_shim | extension was successfully loaded.
2023-10-20 11:12:56 INFO: [ServerApp] jupyter_archive | extension was successfully loaded.
2023-10-20 11:12:56 INFO: [ServerApp] jupyter_server_terminals | extension was successfully loaded.
2023-10-20 11:12:56 INFO: [LabApp] JupyterLab extension loaded from /opt/conda/lib/python3.8/site-packages/jupyterlab
2023-10-20 11:12:56 INFO: [LabApp] JupyterLab application directory is /opt/conda/share/jupyter/lab
2023-10-20 11:12:56 INFO: [ServerApp] jupyterlab | extension was successfully loaded.
2023-10-20 11:12:56
2023-10-20 11:12:56 _ _ _ _
2023-10-20 11:12:56 | | | |_ __ __| |__ _| |_ ___
2023-10-20 11:12:56 | |_| | '_ \/ _` / _` | _/ -_)
2023-10-20 11:12:56 \___/| .__/\__,_\__,_|\__\___|
2023-10-20 11:12:56 |_|
2023-10-20 11:12:56
2023-10-20 11:12:56 Read the migration plan to Notebook 7 to learn about the new features and the actions to take if you are using extensions.
2023-10-20 11:12:56
2023-10-20 11:12:56 https://jupyter-notebook.readthedocs.io/en/latest/migrate_to_notebook7.html
2023-10-20 11:12:56
2023-10-20 11:12:56 Please note that updating to Notebook 7 might break some of your extensions.
2023-10-20 11:12:56
2023-10-20 11:12:56 INFO: [ServerApp] nbclassic | extension was successfully loaded.
2023-10-20 11:12:56 INFO: [ServerApp] Serving notebooks from local directory: /run/determined/workdir
2023-10-20 11:12:56 INFO: [ServerApp] Jupyter Server 2.7.0 is running at:
2023-10-20 11:12:56 INFO: [ServerApp] https://f9941fc1ee0b:3085/proxy/c293a8c8-31c6-4d83-a4ac-70a40e5c057b/lab
2023-10-20 11:12:56 INFO: [ServerApp] https://127.0.0.1:3085/proxy/c293a8c8-31c6-4d83-a4ac-70a40e5c057b/lab
2023-10-20 11:12:56 INFO: [ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Docker containers running:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f9941fc1ee0b determinedai/environments:py-3.8-pytorch-1.12-tf-2.11-cpu-2b7e2a1 "/run/determined/jup…" 8 minutes ago Up 8 minutes 0.0.0.0:32768->3085/tcp infallible_grothendieck
034d5a640a92 determinedai/determined-agent:0.26.1 "/run/determined/wor…" 13 minutes ago Up 13 minutes determined-agent-0
415688c85e56 determinedai/determined-master:0.26.1 "/usr/bin/determined…" 13 minutes ago Up 13 minutes 0.0.0.0:8080->8080/tcp determined_determined-master_1
ca39ae6cde8a postgres:10.14 "docker-entrypoint.s…" 14 minutes ago Up 14 minutes (healthy) 5432/tcp determined_determined-db_1
Agent logs:
2023-10-20 11:08:07 WARN[2023-10-20T09:08:07Z] no configuration file at /etc/determined/agent.yaml, skipping
2023-10-20 11:08:07 INFO[2023-10-20T09:08:07Z] agent configuration: {"config_file":"","master_host":"host.docker.internal","master_port":8080,"agent_id":"determined-agent-0","artificial_slots":0,"slot_type":"auto","container_master_host":"","container_master_port":0,"label":"","resource_pool":"","api_enabled":false,"bind_ip":"0.0.0.0","bind_port":9090,"visible_gpus":"","tls":false,"cert_file":"","key_file":"","http_proxy":"","https_proxy":"","ftp_proxy":"","no_proxy":"","security":{"tls":{"enabled":false,"skip_verify":false,"master_cert":"","master_cert_name":"","client_cert":"","client_key":""}},"fluent":{"image":"","port":0,"container_name":""},"container_auto_remove_disabled":false,"agent_reconnect_attempts":5,"agent_reconnect_backoff":5,"hooks":{"on_connection_lost":null},"container_runtime":"docker","image_root":"","singularity_options":{"allow_network_creation":false},"podman_options":{"allow_network_creation":false},"debug":false}
2023-10-20 11:08:07 TRAC[2023-10-20T09:08:07Z] starting main agent process
2023-10-20 11:08:07 TRAC[2023-10-20T09:08:07Z] connecting to master component=agent
2023-10-20 11:08:07 INFO[2023-10-20T09:08:07Z] connecting to master at: ws://host.docker.internal:8080/agents?id=determined-agent-0&version=0.26.1&resource_pool=&reconnect=false&hostname=docker-desktop component=agent
2023-10-20 11:08:07 TRAC[2023-10-20T09:08:07Z] reading master set agent options message component=agent
2023-10-20 11:08:07 TRAC[2023-10-20T09:08:07Z] running socket read loop component=websocket name=determined-agent-0 remote-addr="192.168.65.254:8080"
2023-10-20 11:08:07 TRAC[2023-10-20T09:08:07Z] running socket write loop component=websocket name=determined-agent-0 remote-addr="192.168.65.254:8080"
2023-10-20 11:08:07 TRAC[2023-10-20T09:08:07Z] detecting devices component=agent
2023-10-20 11:08:07 INFO[2023-10-20T09:08:07Z] detected compute devices:
2023-10-20 11:08:07 INFO[2023-10-20T09:08:07Z] cpu0 ( x 6 cores)
2023-10-20 11:08:07 TRAC[2023-10-20T09:08:07Z] setting up docker runtime component=agent
2023-10-20 11:08:07 INFO[2023-10-20T09:08:07Z] couldn't process ~/.docker/config.json can't read Docker config: open /root/.docker/config.json: no such file or directory component=docker-client
2023-10-20 11:08:07 INFO[2023-10-20T09:08:07Z] can't find any docker credential stores, continuing without them component=docker-client
2023-10-20 11:08:07 INFO[2023-10-20T09:08:07Z] can't find any auths in ~/.docker/config.json, continuing without them component=docker-client
2023-10-20 11:08:07 TRAC[2023-10-20T09:08:07Z] setting up container manager component=agent
2023-10-20 11:08:07 TRAC[2023-10-20T09:08:07Z] reattaching containers component=agent
2023-10-20 11:08:07 DEBU[2023-10-20T09:08:07Z] reattachContainers: expected survivors: [] component=container-manager
2023-10-20 11:08:07 DEBU[2023-10-20T09:08:07Z] reattachContainers: running containers: [] component=container-manager
2023-10-20 11:08:07 TRAC[2023-10-20T09:08:07Z] iterating expected survivors and seeing if they were found component=container-manager
2023-10-20 11:08:07 TRAC[2023-10-20T09:08:07Z] sending SIGKILL to running containers that were not reattached component=container-manager
2023-10-20 11:08:07 TRAC[2023-10-20T09:08:07Z] writing agent started message component=agent
2023-10-20 11:08:07 TRAC[2023-10-20T09:08:07Z] watching for ws requests and system events component=agent
2023-10-20 11:09:14 TRAC[2023-10-20T09:09:14Z] starting container 3223bcc3-4c4b-4294-b95e-78268f622808 component=container-manager
2023-10-20 11:09:14 TRAC[2023-10-20T09:09:14Z] starting container launch component=container cproto-id=3223bcc3-4c4b-4294-b95e-78268f622808
2023-10-20 11:09:14 TRAC[2023-10-20T09:09:14Z] kicking off goroutine shim SIGKILL to cancellations, until we have launched component=container cproto-id=3223bcc3-4c4b-4294-b95e-78268f622808
2023-10-20 11:09:14 TRAC[2023-10-20T09:09:14Z] kicking off goroutine to launch the container component=container cproto-id=3223bcc3-4c4b-4294-b95e-78268f622808
2023-10-20 11:09:14 TRAC[2023-10-20T09:09:14Z] waiting for launch to complete component=container cproto-id=3223bcc3-4c4b-4294-b95e-78268f622808
2023-10-20 11:09:14 TRAC[2023-10-20T09:09:14Z] pulling image component=container cproto-id=3223bcc3-4c4b-4294-b95e-78268f622808
2023-10-20 11:09:14 INFO[2023-10-20T09:09:14Z] transitioning state from ASSIGNED to PULLING component=container cproto-id=3223bcc3-4c4b-4294-b95e-78268f622808 stop="<nil>"
2023-10-20 11:12:53 TRAC[2023-10-20T09:12:53Z] creating container, copying files, etc component=container cproto-id=3223bcc3-4c4b-4294-b95e-78268f622808
2023-10-20 11:12:53 INFO[2023-10-20T09:12:53Z] transitioning state from PULLING to STARTING component=container cproto-id=3223bcc3-4c4b-4294-b95e-78268f622808 stop="<nil>"
2023-10-20 11:12:53 TRAC[2023-10-20T09:12:53Z] starting container component=container cproto-id=3223bcc3-4c4b-4294-b95e-78268f622808 docker-id=f9941fc1ee0b0941ed492c3b8818dca67c92227d0d90a4bb75e20050f5b58306
2023-10-20 11:12:53 TRAC[2023-10-20T09:12:53Z] signal-to-context shimmer exited component=container cproto-id=3223bcc3-4c4b-4294-b95e-78268f622808
2023-10-20 11:12:53 TRAC[2023-10-20T09:12:53Z] transitioning to running state component=container cproto-id=3223bcc3-4c4b-4294-b95e-78268f622808
2023-10-20 11:12:53 INFO[2023-10-20T09:12:53Z] transitioning state from STARTING to RUNNING component=container cproto-id=3223bcc3-4c4b-4294-b95e-78268f622808 stop="<nil>"
2023-10-20 11:12:53 TRAC[2023-10-20T09:12:53Z] in monitoring loop component=container cproto-id=3223bcc3-4c4b-4294-b95e-78268f622808
Curl inside master gets timeout:
do you have any insight why this does not work?
Verbose mode did not yield anymore information using curl:
# curl --insecure https://172.18.0.1:32768 -v
* Trying 172.18.0.1:32768...
* connect to 172.18.0.1 port 32768 failed: Connection timed out
* Failed to connect to 172.18.0.1 port 32768 after 128437 ms: Connection timed out
* Closing connection 0
curl: (28) Failed to connect to 172.18.0.1 port 32768 after 128437 ms: Connection timed out
I am not a docker expert, so maybe this is not relevant, but I was wondering why in your example https://127.0.0.1:32903
a localhost address was used while in my master logs 172.18.0.1
occurs. I suspected this to be linked to the determined_default network which is set up when running det deploy local cluster-up --no-gpu
:
Removing network determined_default
**Creating network determined_default...**
Creating determined_determined-db_1...
Waiting for determined_determined-db_1...
Creating determined_determined-master_1...
Waiting for master instance to be available....
Starting determined-agent-0
Containers running after trying to run jupyter:
fero@BLN-FERO1OSX ~ % docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
581f2aafaaea determinedai/environments:py-3.8-pytorch-1.12-tf-2.11-cpu-2b7e2a1 "/run/determined/jup…" 14 minutes ago Up 14 minutes 0.0.0.0:32768->3134/tcp youthful_stonebraker
72d5c4237bde determinedai/determined-agent:0.26.1 "/run/determined/wor…" 19 minutes ago Up 19 minutes determined-agent-0
17ff592e096e determinedai/determined-master:0.26.1 "/usr/bin/determined…" 19 minutes ago Up 19 minutes 0.0.0.0:8080->8080/tcp determined_determined-master_1
23646c77853a postgres:10.14 "docker-entrypoint.s…" 20 minutes ago Up 20 minutes (healthy) 5432/tcp determined_determined-db_1
Inspecting the docker networks showed that only db and master container are in the determined_default network, I don't know if this is intended.
docker network inspect determined_default
[
{
"Name": "determined_default",
"Id": "7a594f724022b0f7da4ea03a1eec6afe9d60c73c9986427ba224f3fcd84562bc",
"Created": "2023-10-23T08:49:10.04962575Z",
"Scope": "local",
"Driver": "bridge",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": null,
"Config": [
{
"Subnet": "172.18.0.0/16",
"Gateway": "172.18.0.1"
}
]
},
"Internal": false,
"Attachable": true,
"Ingress": false,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": {
"17ff592e096e476ec25d14da167f017f45bcd3dec2d95a65136f3ca88dfb7196": {
"Name": "determined_determined-master_1",
"EndpointID": "652f56185779ad583c450d9194c0fc197ec9d614c62ceb2e70f93e78a18a837b",
"MacAddress": "02:42:ac:12:00:03",
"IPv4Address": "172.18.0.3/16",
"IPv6Address": ""
},
"23646c77853ad8521941175c6270f63961a90469875ec851befb498be75cf2cf": {
"Name": "determined_determined-db_1",
"EndpointID": "dc19e0e4d78e86f3c7e3a365618e19ca04c1c354bb78a9293e184ab679c586cb",
"MacAddress": "02:42:ac:12:00:02",
"IPv4Address": "172.18.0.2/16",
"IPv6Address": ""
}
},
"Options": {},
"Labels": {}
}
]
Agent is in host network mode:
fero@BLN-FERO1OSX ~ % docker network inspect host
[
{
"Name": "host",
"Id": "3f050a8b78973aafc4140e1e99e6c72d6a61f1a5c7653a64e00598379171be1f",
"Created": "2023-09-01T09:04:53.983683458Z",
"Scope": "local",
"Driver": "host",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": null,
"Config": []
},
"Internal": false,
"Attachable": false,
"Ingress": false,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": {
"72d5c4237bdea4076bc5683fe54bc85c2b20e0005d20b525ca3c3415de5b1601": {
"Name": "determined-agent-0",
"EndpointID": "457d6e7cdbf24cde2480a51af4af186e8d91c59f766d67a6d2d2db79f588eee2",
"MacAddress": "",
"IPv4Address": "",
"IPv6Address": ""
}
},
"Options": {},
"Labels": {}
}
]
Jupyter container is in bridge mode:
fero@BLN-FERO1OSX ~ % docker network inspect bridge
[
{
"Name": "bridge",
"Id": "6e350adf502865d9a91cb8664b6912dd990799de33313ef2368f267e475164b2",
"Created": "2023-10-23T08:49:09.70243675Z",
"Scope": "local",
"Driver": "bridge",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": null,
"Config": [
{
"Subnet": "172.17.0.0/16",
"Gateway": "172.17.0.1"
}
]
},
"Internal": false,
"Attachable": false,
"Ingress": false,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": {
"581f2aafaaea9f2c3eb71428c7b4e8574a79cbe646923a0e4384c2a80d5d2c1e": {
"Name": "youthful_stonebraker",
"EndpointID": "85bc23b70833f03aa9c67135a4a255daff41b28702995304101c60b39843c2dd",
"MacAddress": "02:42:ac:11:00:02",
"IPv4Address": "172.17.0.2/16",
"IPv6Address": ""
}
},
"Options": {
"com.docker.network.bridge.default_bridge": "true",
"com.docker.network.bridge.enable_icc": "true",
"com.docker.network.bridge.enable_ip_masquerade": "true",
"com.docker.network.bridge.host_binding_ipv4": "0.0.0.0",
"com.docker.network.bridge.name": "docker0",
"com.docker.network.driver.mtu": "65535"
},
"Labels": {}
}
]
I was able to repro the issue with det deploy local
on macos, works fine on ubuntu, will investigate more.
as a temporary workaround, I can suggest installing master and agent using linux packages or homebrew which should address that problem by not having master wrapped in docker.
@jokokojote did you do your last test on macos? or on ubuntu?
Last test was on macOS.
Ob ubuntu I started it with:
# Start Postgres container
docker run \
--name determined-db \
--network host \
-p 5432:5432 \
-v determined_db:/var/lib/postgresql/data \
-e POSTGRES_DB=determined \
-e POSTGRES_PASSWORD="postgres" \
-d \
postgres:10
# Start Determined master node container
docker run \
--name determined-master \
--network host \
-e DET_DB_HOST=localhost \
-e DET_DB_NAME=determined \
-e DET_DB_PORT=5432 \
-e DET_DB_USER=postgres \
-e DET_DB_PASSWORD="postgres" \
-e http_proxy=http://10.56.130.176:3128 \
-e https_proxy=http://10.56.130.176:3128 \
-e no_proxy=localhost,127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 \
-d \
determinedai/determined-master:0.26.1
# Start Determined agent node container
docker run \
--name determined-agent \
--network host \
-v /var/run/docker.sock:/var/run/docker.sock \
-e DET_MASTER_HOST=localhost \
-e DET_MASTER_PORT=8080 \
-e http_proxy=http://10.56.130.176:3128 \
-e https_proxy=http://10.56.130.176:3128 \
-e no_proxy=localhost,127.0.0.1,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 \
--gpus all \
-d \
determinedai/determined-agent:0.26.1
After running det deploy local
once and had the same problem with jupyter and tensorboard not working.
so the ubuntu setup has the proxy configuration. this often causes problems.
you'd need to setup task_container_defaults->environment_variables
in the master config to also pass the proxy variables. this configuration cannot be passed through the docker run -e
, you'd need to make and mount a config file instead.
otherwise, master and agent has this config, but the spawned containers don't.
Verbose mode did not yield anymore information using curl:
# curl --insecure https://172.18.0.1:32768 -v * Trying 172.18.0.1:32768... * connect to 172.18.0.1 port 32768 failed: Connection timed out * Failed to connect to 172.18.0.1 port 32768 after 128437 ms: Connection timed out * Closing connection 0 curl: (28) Failed to connect to 172.18.0.1 port 32768 after 128437 ms: Connection timed out
I am not a docker expert, so maybe this is not relevant, but I was wondering why in your example
https://127.0.0.1:32903
a localhost address was used while in my master logs172.18.0.1
occurs. I suspected this to be linked to the determined_default network which is set up when runningdet deploy local cluster-up --no-gpu
:Removing network determined_default **Creating network determined_default...** Creating determined_determined-db_1... Waiting for determined_determined-db_1... Creating determined_determined-master_1... Waiting for master instance to be available.... Starting determined-agent-0
Containers running after trying to run jupyter:
fero@BLN-FERO1OSX ~ % docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 581f2aafaaea determinedai/environments:py-3.8-pytorch-1.12-tf-2.11-cpu-2b7e2a1 "/run/determined/jup…" 14 minutes ago Up 14 minutes 0.0.0.0:32768->3134/tcp youthful_stonebraker 72d5c4237bde determinedai/determined-agent:0.26.1 "/run/determined/wor…" 19 minutes ago Up 19 minutes determined-agent-0 17ff592e096e determinedai/determined-master:0.26.1 "/usr/bin/determined…" 19 minutes ago Up 19 minutes 0.0.0.0:8080->8080/tcp determined_determined-master_1 23646c77853a postgres:10.14 "docker-entrypoint.s…" 20 minutes ago Up 20 minutes (healthy) 5432/tcp determined_determined-db_1
Inspecting the docker networks showed that only db and master container are in the determined_default network, I don't know if this is intended.
docker network inspect determined_default [ { "Name": "determined_default", "Id": "7a594f724022b0f7da4ea03a1eec6afe9d60c73c9986427ba224f3fcd84562bc", "Created": "2023-10-23T08:49:10.04962575Z", "Scope": "local", "Driver": "bridge", "EnableIPv6": false, "IPAM": { "Driver": "default", "Options": null, "Config": [ { "Subnet": "172.18.0.0/16", "Gateway": "172.18.0.1" } ] }, "Internal": false, "Attachable": true, "Ingress": false, "ConfigFrom": { "Network": "" }, "ConfigOnly": false, "Containers": { "17ff592e096e476ec25d14da167f017f45bcd3dec2d95a65136f3ca88dfb7196": { "Name": "determined_determined-master_1", "EndpointID": "652f56185779ad583c450d9194c0fc197ec9d614c62ceb2e70f93e78a18a837b", "MacAddress": "02:42:ac:12:00:03", "IPv4Address": "172.18.0.3/16", "IPv6Address": "" }, "23646c77853ad8521941175c6270f63961a90469875ec851befb498be75cf2cf": { "Name": "determined_determined-db_1", "EndpointID": "dc19e0e4d78e86f3c7e3a365618e19ca04c1c354bb78a9293e184ab679c586cb", "MacAddress": "02:42:ac:12:00:02", "IPv4Address": "172.18.0.2/16", "IPv6Address": "" } }, "Options": {}, "Labels": {} } ]
Agent is in host network mode:
fero@BLN-FERO1OSX ~ % docker network inspect host [ { "Name": "host", "Id": "3f050a8b78973aafc4140e1e99e6c72d6a61f1a5c7653a64e00598379171be1f", "Created": "2023-09-01T09:04:53.983683458Z", "Scope": "local", "Driver": "host", "EnableIPv6": false, "IPAM": { "Driver": "default", "Options": null, "Config": [] }, "Internal": false, "Attachable": false, "Ingress": false, "ConfigFrom": { "Network": "" }, "ConfigOnly": false, "Containers": { "72d5c4237bdea4076bc5683fe54bc85c2b20e0005d20b525ca3c3415de5b1601": { "Name": "determined-agent-0", "EndpointID": "457d6e7cdbf24cde2480a51af4af186e8d91c59f766d67a6d2d2db79f588eee2", "MacAddress": "", "IPv4Address": "", "IPv6Address": "" } }, "Options": {}, "Labels": {} } ]
Jupyter container is in bridge mode:
fero@BLN-FERO1OSX ~ % docker network inspect bridge [ { "Name": "bridge", "Id": "6e350adf502865d9a91cb8664b6912dd990799de33313ef2368f267e475164b2", "Created": "2023-10-23T08:49:09.70243675Z", "Scope": "local", "Driver": "bridge", "EnableIPv6": false, "IPAM": { "Driver": "default", "Options": null, "Config": [ { "Subnet": "172.17.0.0/16", "Gateway": "172.17.0.1" } ] }, "Internal": false, "Attachable": false, "Ingress": false, "ConfigFrom": { "Network": "" }, "ConfigOnly": false, "Containers": { "581f2aafaaea9f2c3eb71428c7b4e8574a79cbe646923a0e4384c2a80d5d2c1e": { "Name": "youthful_stonebraker", "EndpointID": "85bc23b70833f03aa9c67135a4a255daff41b28702995304101c60b39843c2dd", "MacAddress": "02:42:ac:11:00:02", "IPv4Address": "172.17.0.2/16", "IPv6Address": "" } }, "Options": { "com.docker.network.bridge.default_bridge": "true", "com.docker.network.bridge.enable_icc": "true", "com.docker.network.bridge.enable_ip_masquerade": "true", "com.docker.network.bridge.host_binding_ipv4": "0.0.0.0", "com.docker.network.bridge.name": "docker0", "com.docker.network.driver.mtu": "65535" }, "Labels": {} } ]
I met the problem almost the same. if master is running on an individual server would it possible to access the registered address that 172.18.0.1? That address is an docker accessable, not LAN wide. Is that possible to set the service registering IP address through config agent.yaml file?
I met the problem almost the same. if master is running on an individual server would it possible to access the registered address that 172.18.0.1? That address is an docker accessable, not LAN wide. Is that possible to set the service registering IP address through config agent.yaml file?
Sorry, nothing comes to mind. If complex bridge networking is causing issues, you can try switching to host mode networking.
Setting up local k8s clusters is also much easier nowadays, so that's another path to consider if you don't want to maintain a raw docker setup.
I meet the same issue with the same startup.
<info> [2024-07-17 09:54:23] [fe2dc1b5] copying files to container: /
<info> [2024-07-17 09:54:31] [fe2dc1b5] copying files to container: /run/determined
<info> [2024-07-17 09:54:37] [fe2dc1b5] copying files to container: /
<info> [2024-07-17 09:54:43] [fe2dc1b5] copying files to container: /
<info> [2024-07-17 09:54:46] [fe2dc1b5] copying files to container: /
<info> [2024-07-17 09:54:52] [fe2dc1b5] Resources for JupyterLab (especially-legal-warthog) have started
<warning> [2024-07-17 09:55:00] [fe2dc1b5] Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
<info> [2024-07-17 09:55:04] [fe2dc1b5] [26] determined: detected 0 gpus (nvidia-smi not found)
<info> [2024-07-17 09:55:04] [fe2dc1b5] [26] determined: rocm-smi not found
<info> [2024-07-17 09:55:04] [fe2dc1b5] [26] determined: detected 0 gpus (nvidia-smi not found)
<info> [2024-07-17 09:55:04] [fe2dc1b5] [26] determined: rocm-smi not found
<info> [2024-07-17 09:55:04] [fe2dc1b5] [26] determined: Running task container on agent_id=determined-agent-0, hostname=ec3212405320 with visible GPUs []
<info> [2024-07-17 09:55:05] [fe2dc1b5] [26] determined: detected 0 gpu processes (nvidia-smi not found)
<> [2024-07-17 09:55:05] [fe2dc1b5] + test -f /run/determined/dynamic-tcd-startup-hook.sh
<> [2024-07-17 09:55:05] [fe2dc1b5] + test -f startup-hook.sh
<> [2024-07-17 09:55:05] [fe2dc1b5] + set +x
<warning> [2024-07-17 09:55:14] [fe2dc1b5] root:jupyter is still not reachable at ('127.0.0.1', 3181)
<warning> [2024-07-17 09:55:22] [fe2dc1b5] [ServerApp] ServerApp.token config is deprecated in 2.0. Use IdentityProvider.token.
<info> [2024-07-17 09:55:23] [fe2dc1b5] [ServerApp] Extension package jupyter_server_terminals took 0.5429s to import
<warning> [2024-07-17 09:55:24] [fe2dc1b5] root:jupyter is still not reachable at ('127.0.0.1', 3181)
<info> [2024-07-17 09:55:25] [fe2dc1b5] [ServerApp] Extension package jupyter_server_ydoc took 2.1832s to import
<warning> [2024-07-17 09:55:26] [fe2dc1b5] [ServerApp] A `_jupyter_server_extension_points` function was not found in nbclassic. Instead, a `_jupyter_server_extension_paths` function was found and will be used for now. This function name will be deprecated in future releases of Jupyter Server.
<info> [2024-07-17 09:55:26] [fe2dc1b5] [ServerApp] jupyter_archive | extension was successfully linked.
<info> [2024-07-17 09:55:26] [fe2dc1b5] [ServerApp] jupyter_server_fileid | extension was successfully linked.
<info> [2024-07-17 09:55:26] [fe2dc1b5] [ServerApp] jupyter_server_terminals | extension was successfully linked.
<info> [2024-07-17 09:55:26] [fe2dc1b5] [ServerApp] jupyter_server_ydoc | extension was successfully linked.
<info> [2024-07-17 09:55:26] [fe2dc1b5] [ServerApp] jupyterlab | extension was successfully linked.
<info> [2024-07-17 09:55:26] [fe2dc1b5] [ServerApp] nbclassic | extension was successfully linked.
<info> [2024-07-17 09:55:26] [fe2dc1b5] [ServerApp] Writing Jupyter server cookie secret to /run/determined/jupyter/runtime/jupyter_cookie_secret
<info> [2024-07-17 09:55:34] [fe2dc1b5] [ServerApp] notebook_shim | extension was successfully linked.
<warning> [2024-07-17 09:55:34] [fe2dc1b5] root:jupyter is still not reachable at ('127.0.0.1', 3181)
<info> [2024-07-17 09:55:35] [fe2dc1b5] [ServerApp] notebook_shim | extension was successfully loaded.
<info> [2024-07-17 09:55:35] [fe2dc1b5] [ServerApp] jupyter_archive | extension was successfully loaded.
<info> [2024-07-17 09:55:35] [fe2dc1b5] [FileIdExtension] Configured File ID manager: ArbitraryFileIdManager
<info> [2024-07-17 09:55:35] [fe2dc1b5] [FileIdExtension] ArbitraryFileIdManager : Configured root dir: /
<info> [2024-07-17 09:55:35] [fe2dc1b5] [FileIdExtension] ArbitraryFileIdManager : Configured database path: /run/determined/jupyter/data/file_id_manager.db
<info> [2024-07-17 09:55:35] [fe2dc1b5] [FileIdExtension] ArbitraryFileIdManager : Successfully connected to database file.
<info> [2024-07-17 09:55:35] [fe2dc1b5] [FileIdExtension] ArbitraryFileIdManager : Creating File ID tables and indices with journal_mode = DELETE
<info> [2024-07-17 09:55:35] [fe2dc1b5] [FileIdExtension] Attached event listeners.
<info> [2024-07-17 09:55:35] [fe2dc1b5] [ServerApp] jupyter_server_fileid | extension was successfully loaded.
<info> [2024-07-17 09:55:35] [fe2dc1b5] [ServerApp] jupyter_server_terminals | extension was successfully loaded.
<info> [2024-07-17 09:55:35] [fe2dc1b5] [ServerApp] jupyter_server_ydoc | extension was successfully loaded.
<info> [2024-07-17 09:55:35] [fe2dc1b5] [LabApp] JupyterLab extension loaded from /opt/conda/lib/python3.10/site-packages/jupyterlab
<info> [2024-07-17 09:55:35] [fe2dc1b5] [LabApp] JupyterLab application directory is /opt/conda/share/jupyter/lab
<info> [2024-07-17 09:55:35] [fe2dc1b5] [ServerApp] jupyterlab | extension was successfully loaded.
<> [2024-07-17 09:55:35] [fe2dc1b5]
<> [2024-07-17 09:55:35] [fe2dc1b5] _ _ _ _
<> [2024-07-17 09:55:35] [fe2dc1b5] | | | |_ __ __| |__ _| |_ ___
<> [2024-07-17 09:55:35] [fe2dc1b5] | |_| | '_ \/ _` / _` | _/ -_)
<> [2024-07-17 09:55:35] [fe2dc1b5] \___/| .__/\__,_\__,_|\__\___|
<> [2024-07-17 09:55:35] [fe2dc1b5] |_|
<> [2024-07-17 09:55:35] [fe2dc1b5]
<> [2024-07-17 09:55:35] [fe2dc1b5] Read the migration plan to Notebook 7 to learn about the new features and the actions to take if you are using extensions.
<> [2024-07-17 09:55:35] [fe2dc1b5]
<> [2024-07-17 09:55:35] [fe2dc1b5] https://jupyter-notebook.readthedocs.io/en/latest/migrate_to_notebook7.html
<> [2024-07-17 09:55:35] [fe2dc1b5]
<> [2024-07-17 09:55:35] [fe2dc1b5] Please note that updating to Notebook 7 might break some of your extensions.
<> [2024-07-17 09:55:35] [fe2dc1b5]
<info> [2024-07-17 09:55:35] [fe2dc1b5] [ServerApp] nbclassic | extension was successfully loaded.
<info> [2024-07-17 09:55:35] [fe2dc1b5] [ServerApp] Serving notebooks from local directory: /
<info> [2024-07-17 09:55:35] [fe2dc1b5] [ServerApp] Jupyter Server 2.14.1 is running at:
<info> [2024-07-17 09:55:35] [fe2dc1b5] [ServerApp] https://localhost:3181/proxy/8e8dd8ad-1633-40aa-b5a1-159e99b991e7/lab?token=...
<info> [2024-07-17 09:55:35] [fe2dc1b5] [ServerApp] https://127.0.0.1:3181/proxy/8e8dd8ad-1633-40aa-b5a1-159e99b991e7/lab?token=...
<info> [2024-07-17 09:55:35] [fe2dc1b5] [ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
<info> [2024-07-17 09:55:35] || INFO: Service of JupyterLab (especially-legal-warthog) is available
<warning> [2024-07-17 09:56:08] [fe2dc1b5] [ServerApp] SSL Error on 12 ('172.17.0.1', 46408): [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
<warning> [2024-07-17 09:56:09] [fe2dc1b5] [ServerApp] SSL Error on 12 ('172.17.0.1', 46416): [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
<warning> [2024-07-17 09:56:11] [fe2dc1b5] [ServerApp] SSL Error on 12 ('172.17.0.1', 46420): [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
<warning> [2024-07-17 09:56:12] [fe2dc1b5] [ServerApp] SSL Error on 12 ('172.17.0.1', 46434): [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
<warning> [2024-07-17 09:56:12] [fe2dc1b5] [ServerApp] SSL Error on 12 ('172.17.0.1', 46436): [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
<warning> [2024-07-17 09:56:13] [fe2dc1b5] [ServerApp] SSL Error on 12 ('172.17.0.1', 34224): [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
<warning> [2024-07-17 10:01:37] [fe2dc1b5] [ServerApp] SSL Error on 12 ('172.17.0.1', 52944): [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
<warning> [2024-07-17 10:01:38] [fe2dc1b5] [ServerApp] SSL Error on 12 ('172.17.0.1', 52950): [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
<warning> [2024-07-17 10:01:43] [fe2dc1b5] [ServerApp] SSL Error on 12 ('172.17.0.1', 40750): [SSL: HTTP_REQUEST] http request (_ssl.c:1007)
<warning> [2024-07-17 10:01:45] [fe2dc1b5] [ServerApp] SSL Error on 12 ('172.17.0.1', 40752): [SSL: SSLV3_ALERT_CERTIFICATE_UNKNOWN] sslv3 alert certificate unknown (_ssl.c:1007)
<warning> [2024-07-17 10:01:49] [fe2dc1b5] [ServerApp] SSL Error on 12 ('172.17.0.1', 40764): [SSL: SSLV3_ALERT_CERTIFICATE_UNKNOWN] sslv3 alert certificate unknown (_ssl.c:1007)
<warning> [2024-07-17 10:01:49] [fe2dc1b5] [ServerApp] 404 GET / (@172.17.0.1) 154.75ms referer=None
<warning> [2024-07-17 10:01:49] [fe2dc1b5] [ServerApp] SSL Error on 13 ('172.17.0.1', 40772): [SSL: SSLV3_ALERT_CERTIFICATE_UNKNOWN] sslv3 alert certificate unknown (_ssl.c:1007)
<warning> [2024-07-17 10:01:49] [fe2dc1b5] [ServerApp] SSL Error on 13 ('172.17.0.1', 40784): [SSL: SSLV3_ALERT_CERTIFICATE_UNKNOWN] sslv3 alert certificate unknown (_ssl.c:1007)
<warning> [2024-07-17 10:01:49] [fe2dc1b5] [ServerApp] SSL Error on 14 ('172.17.0.1', 40790): [SSL: SSLV3_ALERT_CERTIFICATE_UNKNOWN] sslv3 alert certificate unknown (_ssl.c:1007)
<info> [2024-07-17 10:01:55] [fe2dc1b5] [LabApp] 302 GET /proxy/8e8dd8ad-1633-40aa-b5a1-159e99b991e7/lab (@172.17.0.1) 1.60ms
<info> [2024-07-17 10:01:58] [fe2dc1b5] [LabApp] 302 GET /proxy/8e8dd8ad-1633-40aa-b5a1-159e99b991e7/lab (@172.17.0.1) 1.21ms
I might figure out why this issue happened. Refer to the log in master container:
http: proxy error: dial tcp 172.27.0.1:32807: i/o timeout
I found something which is showed below:
determined_default
.
❯ docker network inspect determined_default
[
{
"Name": "determined_default",
"Id": "744f81e72ad1f8955e795dbb07b840bfbf60bc77b82051229adf69eb33bd7dca",
"Created": "2024-07-18T07:23:24.691171616Z",
"Scope": "local",
"Driver": "bridge",
"EnableIPv6": false,
"IPAM": {
"Driver": "default",
"Options": null,
"Config": [
{
"Subnet": "172.27.0.0/16",
"Gateway": "172.27.0.1"
}
]
},
"Internal": false,
"Attachable": true,
"Ingress": false,
"ConfigFrom": {
"Network": ""
},
"ConfigOnly": false,
"Containers": {
"256f0f6b0c8251f9f007843f29dbb27bb616b856d2c1d38988cd8293e9e6bc76": {
"Name": "determined_determined-master_1",
"EndpointID": "724f92566fd74240ccdd44efaef60a7944d0742439983999c0bc0c6c99dac1f6",
"MacAddress": "02:42:ac:1b:00:03",
"IPv4Address": "172.27.0.3/16",
"IPv6Address": ""
},
"912549e04df157dbcd8b1956b49a27c5d28b7920850f97f4d0cc6ea2766c3473": {
"Name": "determined_determined-db_1",
"EndpointID": "3c6f4c2f17749a43526a36fe173260de61b8e0f608670ef7ce087b8710163e49",
"MacAddress": "02:42:ac:1b:00:02",
"IPv4Address": "172.27.0.2/16",
"IPv6Address": ""
}
},
"Options": {},
"Labels": {}
}
]
We can see that 172.27.0.1 is the gateway of docker network named determined_default
,not the IP of JupyterLab container.
determined_default
.32807
I found in master container's log is published port of JupyterLab container.# 172.17.0.2 is the IP of JupyterLab container and 172.21.53.125 is the IP of host machine
❯ docker run --rm --network determined_default busybox telnet 172.17.0.2 2925
^Ctelnet: can't connect to remote host (172.17.0.2): Connection timed out
❯ docker run --rm busybox telnet 172.17.0.2 2925
Connected to 172.17.0.2
❯ docker run --rm --network determined_default busybox telnet 172.21.53.125 32807
Connected to 172.21.53.125
So, the conclusion is that there was something wrong with proxy module. It need to redirect it to the right IP and port.
Unfortunately, I'm not a golang programmer. Could anyone help to fix this up?
Describe the bug
I am not sure if this is a bug or I missed some basic config step, but I checked the docs multiple times and did not find any information about this:
Jupyter lab and tensorboard are stuck at "Waiting for ..." after the docker was run successfully w/o any errors shown in the logs.
Tried with 0.26.1, 0.26.0, 0.25.1 and 0.21.2 on MacOS, Ubuntu and Windows.
TensorBoard 0.26.1 logs:
Jupyter 0.26.1 logs:
Jupyter 0.21.2 logs:
Reproduction Steps
det deploy local cluster-up --no-gpu
2.a. Open the UI: Tasks -> launch Jupyter OR 2.b.1 Run an experiment e.g. _gan_mnistpytorch with
det experiment create const.yaml .
2.b.2 Open the UI, open the experiment, open tensorboard
Expected Behavior
UI for Jupiter lab / tensorboard should open after some (short) waiting time (or a meaningful error message should show up at least).
Screenshot
Environment
Additional Context
No response