determined-ai / determined

Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, and resource management. Works with PyTorch and TensorFlow.
https://determined.ai
Apache License 2.0
2.98k stars 348 forks source link

🐛[bug] pulling container image: error parsing image name #9553

Open KyanChen opened 2 months ago

KyanChen commented 2 months ago

Describe the bug

[89ab6926] crashed: task failed without an associated exit code: pulling container image: error parsing image name /mnt/jfs/singularity_image_root/determinedai/environments:cuda-11.8-pytorch-2.0-gpu-mpi-0.31.1: invalid reference format

[2024-06-23 11:01:33] || ERROR: Trial 10 (Experiment 10) was terminated: allocation failed: task failed without an associated exit code: pulling container image: error parsing image name /mnt/jfs/singularity_image_root/determinedai/environments:cuda-11.8-pytorch-2.0-gpu-mpi-0.31.1: invalid reference format ### Reproduction Steps using a local image ### Expected Behavior using a local image success ### Screenshot [89ab6926] crashed: task failed without an associated exit code: pulling container image: error parsing image name /mnt/jfs/singularity_image_root/determinedai/environments:cuda-11.8-pytorch-2.0-gpu-mpi-0.31.1: invalid reference format [2024-06-23 11:01:33] || ERROR: Trial 10 (Experiment 10) was terminated: allocation failed: task failed without an associated exit code: pulling container image: error parsing image name /mnt/jfs/singularity_image_root/determinedai/environments:cuda-11.8-pytorch-2.0-gpu-mpi-0.31.1: invalid reference format ### Environment - Device or hardware: [e.g. iPhone6, Nvidia A100] - OS: [e.g. iOS] - Browser [e.g. chrome, safari] - Version [e.g. 22] ### Additional Context _No response_
KyanChen commented 2 months ago
image image
KyanChen commented 2 months ago
image
ioga commented 2 months ago

hello,

environment.image is not a file path, it's an image reference. to load it from local disk with singularity, see image_root: https://docs.determined.ai/latest/reference/deploy/agent-config-reference.html#image-root

also fyi, apptainer/singularity support in the agent-based cluster is deprecated and scheduled for removal. if you are an enterprise user looking to use this feature, please let us know.