allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.42k stars 643 forks source link

Services queue no longer working #1252

Closed katrinarobinson2000 closed 2 months ago

katrinarobinson2000 commented 2 months ago

Describe the bug

Whenever I try to run a task on services, it fails with the error "/usr/bin/python3.8: No module named virtualenv". I have tried adding different workers to the queue, but I get this error regardless of the worker. And when I try those workers with different queues they work, which indicates that the problem is specific to the Services queue. I have tried with the default docker image and also different docker images that work on different queues.

To reproduce

1) Add a worker to the Services queue 2) Run a task on the Services queue

Expected behaviour

The task should have run successfully, like it does with other queues.

Environment

jkhenning commented 2 months ago

Hi @katrinarobinson2000, can you include the full task log? What is the docker image you're trying to run the task with?

katrinarobinson2000 commented 2 months ago

I have tried with multiple docker images and the default nvidia/cuda:11.8.0-base-ubuntu20.04 image. These images work on other queues so I don't think that's the problem. Full task log:

1714082982808 training-02:cpu:8 INFO task ef68556dfb1447b296f8df7162010ae9 pulled from a5f9687681084ae59b27ffd3f4b77d77 by worker training-02:cpu:8

1714082987913 training-02:cpu:8 DEBUG Running task 'ef68556dfb1447b296f8df7162010ae9'

1714082988799 training-02:cpu:8:service:ef68556dfb1447b296f8df7162010ae9 DEBUG Process failed, exit code 1
1714082988840 training-02:cpu:8:service:ef68556dfb1447b296f8df7162010ae9 DEBUG Current configuration (clearml_agent v1.5.2, location: /tmp/.clearml_agent.4wvyqfny.cfg):
----------------------
api.version = 1.5
api.verify_certificate = true
api.default_version = 1.5
api.http.max_req_size = 15728640
api.http.retries.total = 240
api.http.retries.connect = 240
api.http.retries.read = 240
api.http.retries.redirect = 240
api.http.retries.status = 240
api.http.retries.backoff_factor = 1.0
api.http.retries.backoff_max = 120.0
api.http.wait_on_maintenance_forever = true
api.http.pool_maxsize = 512
api.http.pool_connections = 512
api.api_server = http://192.168.128.212:8008
api.web_server = http://192.168.128.212:8080
api.files_server = http://192.168.128.212:8081
api.credentials.access_key = R1GO2GQ2R95KLTM5OXH3

agent.worker_id = training-02:cpu:8:service:ef68556dfb1447b296f8df7162010ae9
agent.worker_name = training-02
agent.force_git_ssh_protocol = true
agent.python_binary = 
agent.package_manager.type = pip
agent.package_manager.pip_version.0 = <20.2 ; python_version < '3.10'
agent.package_manager.pip_version.1 = <22.3 ; python_version >\= '3.10'
agent.package_manager.system_site_packages = true
agent.package_manager.force_upgrade = false
agent.package_manager.conda_channels.0 = pytorch
agent.package_manager.conda_channels.1 = conda-forge
agent.package_manager.conda_channels.2 = defaults
agent.package_manager.conda_channels.3 = nvidia
agent.package_manager.priority_optional_packages.0 = pygobject
agent.package_manager.torch_nightly = false
agent.package_manager.poetry_files_from_repo_working_dir = false
agent.package_manager.force_repo_requirements_txt = true
agent.package_manager.priority_packages.0 = opencv-python-headless
agent.venvs_dir = /opt/clearml/venvs-builds.8.9
agent.venvs_cache.max_entries = 10
agent.venvs_cache.free_space_threshold_gb = 2.0
agent.venvs_cache.path = /opt/clearml/venvs-cache
agent.vcs_cache.enabled = true
agent.vcs_cache.path = /opt/clearml/vcs-cache
agent.venv_update.enabled = false
agent.pip_download_cache.enabled = true
agent.pip_download_cache.path = /opt/clearml/pip-download-cache
agent.translate_ssh = true
agent.reload_config = false
agent.docker_pip_cache = /opt/clearml/pip-cache
agent.docker_apt_cache = /opt/clearml/apt-cache.8.9
agent.docker_force_pull = true
agent.default_docker.image = nvidia/cuda:11.8.0-base-ubuntu20.04
agent.enable_task_env = false
agent.hide_docker_command_env_vars.enabled = true
agent.hide_docker_command_env_vars.parse_embedded_urls = true
agent.abort_callback_max_timeout = 1800
agent.docker_internal_mounts.sdk_cache = /opt/clearml/sdk-cache
agent.docker_internal_mounts.apt_cache = /opt/clearml/apt-cache
agent.docker_internal_mounts.ssh_folder = /root/.ssh
agent.docker_internal_mounts.ssh_ro_folder = /root/.ssh
agent.docker_internal_mounts.pip_cache = /opt/clearml/cache/pip-cache
agent.docker_internal_mounts.poetry_cache = /opt/cache/pypoetry
agent.docker_internal_mounts.vcs_cache = /opt/clearml/vcs-cache
agent.docker_internal_mounts.venv_build = /opt/clearml/venvs-builds
agent.docker_internal_mounts.pip_download = /opt/clearml/pip-download-cache
agent.apply_environment = true
agent.apply_files = true
agent.custom_build_script = 
agent.disable_task_docker_override = false
agent.git_user = 
agent.docker_use_activated_venv = true
agent.disable_ssh_mount = false
agent.docker_install_opencv_libs = false
agent.default_python = 3.8
agent.cuda_version = 0
agent.cudnn_version = 0
sdk.storage.cache.default_base_dir = ~/.clearml/cache
sdk.storage.cache.size.min_free_bytes = 10GB
sdk.storage.direct_access.0.url = file://*
sdk.metrics.file_history_size = 100
sdk.metrics.matplotlib_untitled_history_size = 100
sdk.metrics.images.format = JPEG
sdk.metrics.images.quality = 87
sdk.metrics.images.subsampling = 0
sdk.metrics.tensorboard_single_series_per_graph = true
sdk.network.metrics.file_upload_threads = 4
sdk.network.metrics.file_upload_starvation_warning_sec = 120
sdk.network.iteration.max_retries_on_server_error = 5
sdk.network.iteration.retry_backoff_factor_sec = 10
sdk.aws.s3.key = 
sdk.aws.s3.region = 
sdk.aws.boto3.pool_connections = 512
sdk.aws.boto3.max_multipart_concurrency = 16
sdk.log.null_log_propagate = false
sdk.log.task_log_buffer_capacity = 66
sdk.log.disable_urllib3_info = true
sdk.development.task_reuse_time_window_in_hours = 72.0
sdk.development.vcs_repo_detect_async = true
sdk.development.store_uncommitted_code_diff = true
sdk.development.support_stopping = true

sdk.development.force_analyze_entire_repo = false
sdk.development.suppress_update_message = false
sdk.development.detect_with_pip_freeze = false
sdk.development.worker.report_period_sec = 2
sdk.development.worker.ping_period_sec = 30
sdk.development.worker.log_stdout = true
sdk.development.worker.report_global_mem_used = false

Executing task id [ef68556dfb1447b296f8df7162010ae9]:
repository = git@gitlab.********
branch = 
version_num = eb2fa20d506be3e70092455705ec0e8e1350e816
tag = 
docker_cmd = 
entry_point = trigger_export_detection.py
working_dir = utils

[package_manager.force_repo_requirements_txt=true] Skipping requirements, using repository "requirements.txt" 

/usr/bin/python3.8: No module named virtualenv

clearml_agent: ERROR: Command '['python3.8', '-m', 'virtualenv', '/opt/clearml/venvs-builds.8.9/3.8', '--system-site-packages']' returned non-zero exit status 1.
jkhenning commented 2 months ago

The log says the worker running the task is training-02:cpu:8, not the services worker?

katrinarobinson2000 commented 2 months ago

training-02:cpu:8 is a worker I assigned to the services queue. When I start running the task, at the top of the console it says Hostname: training-02:cpu:8:4:service:aea1e678da314bac972abc1f4294de68 but then once the task fails it changes to Hostname: training-02:cpu:8.