Open lolpa1n opened 1 month ago
@lolpa1n As you'll see in this example, the queue specified in PipelineController.start
is the one through which the controller itself will be executed.
The queue through which the pipeline steps will be executed is controlled through PipelineController.set_default_execution_queue
@lolpa1n As you'll see in this example, the queue specified in
PipelineController.start
is the one through which the controller itself will be executed. The queue through which the pipeline steps will be executed is controlled throughPipelineController.set_default_execution_queue
i am add pipe.set_default_execution_queue(default_execution_queue='test_gpu')
but nothing happens -
maybe is this because I'm running on the same machine?
Hello, I have a similar problem. I run pipelines in remote execution mode. It gets into the queue and the agent starts the container. Sometimes the first stage works and the second stage doesn't start - Launching step.... And sometimes the first stage doesn't launch either - just the Launching step [stage name] message hangs without errors.... When run_locally() is executed, the pipeline works completely
@lolpa1n Sounds like the issue is more that you are using the same agent (you can deploy multiple agents on the same machine) - It can't take care of the steps since its busy handling the pipeline controller which in turn is waiting for the steps to complete.
@suparshukov Not sure the same applies for your use case? I think you'll need to take a look at the logs inside the container that appears as if its not doing anything to better isolate the issue.
Experimenting with steps from functions and draft=True
option, getting the same result - the first step of my pipeline just hangs indefinitely. I have 2 clearml agents which each have each its own queue: one is --cpu-only
and handles pipeline controller, and another one uses GPU and its queue is set as default.
Also, I'm using freshly built docker image that I didn't push into our artifactory. I set it on ClearML UI, and during the execution it said:
Error response from daemon: pull access denied for clearml_worker_etl_test, repository does not exist or may require 'docker login': denied: requested access to the resource is denied
If it went ahead and tried to pull it from the artifactory, couldn't find it, and then went ahead with the execution anyway, where the heck does it execute the pipeline?
Hi @kiranzo , can you include the complete log?
Hello,
I deployed clearml server on my machine and wanted to make pipelines:
my code:
if I execute:
pipe.start_locally(run_pipeline_steps_locally=True)
then everything works, but if I change topipe.start(queue='test_gpu')
, afterclearml-agent daemon --detached --queue test_gpu --gpus 0:
then nothing happens and the green status - QUEUED logs:
Tell me pls, how to do this correctly, if I want, for example, to select a specific GPU, etc. for launch