Open abelBEDOYA opened 1 year ago
Hi there!
Thanks for filing a bug report :)
First of all, I would never recommend calling model.train
in an API route (even if it is never actually executed due to the execute_remotely call). It makes it so your api has ML dependencies like Torch that can be quite heavy and unnecessary!
What seems to happen is that when executing remotely, ClearML will try to recognize and interpret your local environment, so it can be recreated on the remote machine. But since you're running it from the API, all the API requirements are also detected as dependencies of the task itself :)
That said, I think I see what you're trying to do. I would go for the following flow instead:
1) Have a "template" training task ready to go, that you know will run remotely (let me know how that goes using YOLOv8!). You run this template task once with some random parameters, just to verify that it works. This part has nothing to do with the API
2) Inside the API, you can now clone this task as such: https://clear.ml/docs/latest/docs/clearml_sdk/task_sdk#cloning--executing-tasks
3) The cloned task is now in draft mode, so you can override its original parameters using: https://clear.ml/docs/latest/docs/references/sdk/task/#update_parameters
4) Now enqueue the task using the Task.enqueue
function as described in the docs link of 2)
Now your task should be in the queue, the API can return successfully and a worker can start working on it. You'll need a second API endpoint that you can poll every 1min for example, that a client can use to get the status. You can of course also just return the task_id in the first API when it's created, so a client can ask updates by asking the clearml server directly.
Does this help?
@thepycoder Good idea! But, how can I get that "first" training that will be used as a template? I have carried out local YOLO trainings (working correctly without API) but they don't work as a template :( When I try to clone and run them in my agent queue from app.clearml the console says some files are not found, the train.yaml
for example. (check the screenshot)
Thanks for the quick reply!
Like you were thinking yourself, the agent indeed needs to be able to access your train.yaml
file on its local filesystem. You have multiple options to get it there, here are some:
train.yaml
file to the git repo. The agent will pull the repo and it should find the file!clearml-data
so you can use Dataset.get().get_local_copy()
to get a local copyIn this way, the agent is no different than e.g. a colab instance, you'll have to give it a way to access your files :)
Hi, I've been using ClearML and I've been tracking my yolov8 trainings which were carried out locally. To do so, I build an API with Fast API which can launch a training with custom hyperparams.
Now, I'm trying to do so remotely. So I build an agent and a queue (
cola_yolo8
) with that agent. However, if I addexecute_remotely()
in the API training method:This error shows up in the clearML task console:
I run this command to run the training API:
uvicorn main:app --host 0.0.0.0 --port 3000 --reload
I've tried simpler tasks instead of yolov8 training and the error is the same.
Thanks!