allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.43k stars 643 forks source link

Pipeline example does not work #1244

Closed jingli-wtbox closed 2 months ago

jingli-wtbox commented 2 months ago

Describe the bug

A clear and concise description of what the bug is.

if i comment out below line (line 66) in examples/pipeline/pipeline_from_tasks.py, the pipeline will upload from my local PC to clearml-server. but it will stop after run step 1 and nothing happens after that even i wait for a long time.

pipe.start_locally()

please refer to below image: Screenshot 2024-04-12 at 1 56 26 PM

if i uncomment above code, everything is fine. all steps will run and the status will be changed to completed.

Screenshot 2024-04-12 at 1 57 20 PM

To reproduce

Exact steps to reproduce the bug. Provide example code if possible.

step 1: setup clearml-server on one aws ec2 and configure the server

step 2: setup clearml-agent on another ec2 and start one queue

clearml-agent daemon --queue default

step 3: copy examples to local PC and configure local pc with configure file ~/clearml.conf. open examples using VSCode and run the example:

python pipelines/pipeline_from_tasks.py

with default line 66: #pipe.start_locally()

step 4: run below code with uncommented line 66:

python pipelines/pipeline_from_tasks.py pipe.start_locally()

the issue can be reproduced following above steps.

Expected behaviour

What is the expected behaviour? What should've happened but didn't?

Environment

containers:

jingli-wtbox commented 2 months ago

CLEARML-AGENT version 1.8.0

jingli-wtbox commented 2 months ago

Agent console logs:

Screenshot 2024-04-12 at 2 36 50 PM

ainoam commented 2 months ago

@jingli-wtbox Note that the pipeline controller and steps do not necessarily share the same execution queue: Whereas the example sets the step queue to "default" (which you have set up with an agent to listen to), it leaves the pipeline queue for remote execution with its default value ("services") - Is your server set up with a serviced "services" queue?

jingli-wtbox commented 2 months ago

Hi @ainoam , thank you so much for the reply.

As pointed out at step 2, I set up an agent to listen to "default". Sorry for forgetting to mention that I change line 69 in file "examples/pipeline/pipeline_from_tasks.py" to

pipe.start(queue='default')

if i setup the agent to listen, and change execution queue name to "default", does that mean both steps and pipeline can be executed at the agent machine?

My use case is as below:

i want to deploy clearml server to AWS EC2 instance (Sydney region), and run the agent at one AWS EC2 instance with Nvidia A100 (US region). so i can train or fine-tune LLMs like LLama2 13B. Except for setting up agent, do I missing anything?

thank you.

ainoam commented 2 months ago

Seems like you found your issue @jingli-wtbox - If you're running both pipeline and steps on the same queue, you've created an execution lock as the agent is waiting for the controller to finish to service the next task in your queue, and the controller is waiting for the steps to finish.

Since, as in your case, the controller and step resource requirements are usually considerably different - It makes more sense to run them on different machines. The ClearML services default services queue is intended for that purpose: It is services by an agent running in services mode to be able to run multiple pipelines at the same time.

So, sounds like what you need to do is make sure your pipeline controller is not blocking step execution by running it on an independent queue (and agent), such as the services queue.

jingli-wtbox commented 2 months ago

does that mean all components, decorators, functions are required to be executed on different machines from pipeline for all examples in the folder examples/pipeline?

ainoam commented 2 months ago

@jingli-wtbox Pipeline steps should use a different ClearML agent than the pipeline controller (You can deploy the agents whichever makes the most sense to you).

jingli-wtbox commented 2 months ago

Hi @ainoam , thank you for pointing out that the pipeline step and pipeline controller need different agents. It gives me a more clear understanding of ClearML components.

Is anywhere in the documentation list such a requirement? or only me did not find it? I think it's a very important information for my use case. Without knowing different agents for pipeline controller and pipeline steps, I spent 3 days to find out the reason why pipeline step is blocked.

ainoam commented 2 months ago

@jingli-wtbox It's not an explicit requirement as such, but more a result of how you deploy your agents (since in the basic use-case, an agent will execute tasks one at a time). We can definitely add some comments to that effect.

jingli-wtbox commented 2 months ago

Hello @ainoam. I greatly appreciate your response. Your insights have deepened my understanding of ClearML. I plan to delve deeper into its powerful features and integrate ClearML into our project development.

I particularly admire the ClearML Agent feature, which empowers me to leverage any machine (with GPU) for our model training.