dstackai / dstack

dstack is a lightweight, open-source alternative to Kubernetes & Slurm, simplifying AI container orchestration with multi-cloud & on-prem support. It natively supports NVIDIA, AMD, & TPU.
https://dstack.ai/docs
Mozilla Public License 2.0
1.61k stars 158 forks source link

Do not fail if `/root/.profile` is missing in user-specified Docker image #1086

Open jvstme opened 8 months ago

jvstme commented 8 months ago

Steps to reproduce

Run a configuration with image: fedora:39

> cat hello.dstack.yml
type: task

image: fedora:39
commands:
  - echo Hello

resources:
  cpu: 1..
  memory: 0.3GB..

> dstack run . -f hello.dstack.yml

Expected behaviour

The configuration runs succesfully

Actual behaviour

The run fails. CLI:

ancient-impala-1 provisioning completed (terminating)
Run failed with error code JobTerminationReason.INTERRUPTED_BY_NO_CAPACITY. Check CLI and server logs for more 
details.

Server logs:

ERROR 2024-04-03T14:53:11.772 dstack._internal.server.background.tasks.process_running_jobs The docker container of the job 'ancient-impala-1-0-0' is not working: exit code: 2, error 
DEBUG 2024-04-03T14:53:11.773 dstack._internal.server.background.tasks.process_running_jobs runner healthcheck: {'state': 'pending', 'container_name': 'ancient-impala-1-0-0', 'status': 'exited', 'running': False, 'oom_killed': False, 'dead': False, 'exit_code': 2, 'error': ''}

shim.log on the cloud instance:

2024/04/03 12:52:22 Pulling image
2024/04/03 12:52:22 Creating container
2024/04/03 12:52:22 Unable to stop the container: Error response from daemon: No such container: ancient-impala-1-0-0
2024/04/03 12:52:22 Unable to remove the container: Error response from daemon: No such container: ancient-impala-1-0-0
2024/04/03 12:52:22 Running container, id=baaf276e0a395f9320d599e84ecbbd9518becd0e712bcba51dbd96b3902802a0
2024/04/03 12:53:11 Container finished successfully, id=baaf276e0a395f9320d599e84ecbbd9518becd0e712bcba51dbd96b3902802a0

Container logs on the cloud instance:

# docker logs baaf276e0a395f9320d599e84ecbbd9518becd0e712bcba51dbd96b3902802a0
/bin/sh: line 1: apt-get: command not found
Fedora 39 - x86_64                               23 MB/s |  89 MB     00:03    
Fedora 39 openh264 (From Cisco) - x86_64        3.7 kB/s | 2.6 kB     00:00    
Fedora 39 - x86_64 - Updates                     20 MB/s |  35 MB     00:01    
Dependencies resolved.
================================================================================
 Package                 Arch        Version                 Repository    Size
================================================================================
Installing:
 openssh-server          x86_64      9.3p1-10.fc39           updates      466 k
Upgrading:
 libblkid                x86_64      2.39.3-6.fc39           updates      117 k
 libmount                x86_64      2.39.3-6.fc39           updates      155 k
 libsmartcols            x86_64      2.39.3-6.fc39           updates       67 k
 libuuid                 x86_64      2.39.3-6.fc39           updates       28 k
 systemd-libs            x86_64      254.10-1.fc39           updates      687 k
 util-linux-core         x86_64      2.39.3-6.fc39           updates      508 k
Installing dependencies:
 dbus                    x86_64      1:1.14.10-1.fc39        fedora       8.1 k
 dbus-broker             x86_64      35-2.fc39               updates      176 k
 dbus-common             noarch      1:1.14.10-1.fc39        fedora        15 k
 device-mapper           x86_64      1.02.197-1.fc39         updates      138 k
 device-mapper-libs      x86_64      1.02.197-1.fc39         updates      176 k
 kmod-libs               x86_64      30-6.fc39               fedora        67 k
 libargon2               x86_64      20190702-3.fc39         fedora        28 k
 libfdisk                x86_64      2.39.3-6.fc39           updates      162 k
 libseccomp              x86_64      2.5.3-6.fc39            fedora        71 k
 libutempter             x86_64      1.2.1-10.fc39           fedora        26 k
 openssh                 x86_64      9.3p1-10.fc39           updates      439 k
 systemd                 x86_64      254.10-1.fc39           updates      4.7 M
 systemd-pam             x86_64      254.10-1.fc39           updates      360 k
 util-linux              x86_64      2.39.3-6.fc39           updates      1.2 M
 xkeyboard-config        noarch      2.40-1.fc39             updates      971 k
Installing weak dependencies:
 cryptsetup-libs         x86_64      2.6.1-3.fc39            fedora       491 k
 diffutils               x86_64      3.10-3.fc39             fedora       398 k
 libbpf                  x86_64      2:1.1.0-4.fc39          fedora       165 k
 libxkbcommon            x86_64      1.6.0-1.fc39            updates      142 k
 qrencode-libs           x86_64      4.1.1-5.fc39            fedora        61 k
 systemd-networkd        x86_64      254.10-1.fc39           updates      647 k
 systemd-resolved        x86_64      254.10-1.fc39           updates      293 k

Transaction Summary
================================================================================
Install  23 Packages
Upgrade   6 Packages

... (cut for brevity) ...                                   

Complete!
sed: can't read /root/.profile: No such file or directory

dstack version

0.17.0

Server logs

No response

Additional information

No response

peterschmidt85 commented 7 months ago

This issue is stale because it has been open for 30 days with no activity.

peterschmidt85 commented 6 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale. Please reopen the issue if it is still relevant.

jvstme commented 6 months ago

Still relevant

peterschmidt85 commented 5 months ago

This issue is stale because it has been open for 30 days with no activity.

peterschmidt85 commented 5 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale. Please reopen the issue if it is still relevant.

jvstme commented 1 month ago

Related to #1535, may be fixed there

github-actions[bot] commented 1 week ago

This issue is stale because it has been open for 30 days with no activity.