allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.61k stars 651 forks source link

No console logging with clearml-agent + slurm #1048

Open ngessert opened 1 year ago

ngessert commented 1 year ago

Describe the bug

The ClearML web UI does not show console logs when using clearml-agent to submit a task to a slurm cluster. When manually submitting the task to the slurm cluster, console logging works fine.

This is probably not a bug, but links to the way how clearml-agent and slurm work.

To reproduce

In my scripts, I have a slurm scheduler, that submits jobs to slurm. To enable proper ClearML tracking, I do the following (all in python):

  1. Initialize new ClearML Task
  2. Do the slurm job submission (bash scripts are created and executed from python)
  3. Run this snippet in the end:
    task_id = clearml_task.id
    clearml_task.close()
    Task.get_task(task_id=task_id).mark_started(force=True)
  4. Main python script ends - from here on, other scripts run through slurm

The slurm job consists of several chained sub-jobs. At the start of each subjob, I call this from python:

if not running_remotely():
    Task.debug_simulate_remote_task(task_id=task_id_from_prev)
clearml_task = Task.init(auto_connect_frameworks=False)
run()  # My code execution

Then, anywhere inside the code, I get the task with Task.current_task().

This works fine, when I do the job submission manually from a bash terminal - all console logs appear in the web UI.

However, when I want to do this though a clearml-agent, I do not get any console logs. The process with clearml-agent works like this:

  1. Submit task from ClearML web UI to the clearml-agent that should submit the job to slurm
  2. The clearml-agent runs a custom setup script that installs some stuff
  3. The clearml-agent executes the same script that I ran manually from the bash terminal
  4. The execution works fine, the slurm jobs run correctly.

The only issue is, that I do not get any logs after the submission to slurm. I can see that the console logs are created by slurm as they are written to output files.

If you check my code snippets, you can see that the only difference between clearml-agent and not-clearml-agent is the running_remotely() - in clearml-agent, running_remotely() is already active and I must not set the debug_simulate_remote_task

Any suggestions how I may be able to pipe the console logs into the ClearML web UI?

Expected behaviour

Console logs appear in the ClearML web UI when submitting a job to slurm via clearml-agent.

Environment

Related Discussion

There was another thread on slurm+ClearML, not related to clearml-agent: https://github.com/allegroai/clearml/issues/406

jkhenning commented 1 year ago

Hi @ngessert,

The only issue is, that I do not get any logs after the submission to slurm. I can see that the console logs are created by slurm as they are written to output files.

Maybe for some reason the clearml-agent subprocess piping does not work? what happens if you do a simple:

f = open("outtest.txt", "w")
subprocess.call(["bash", "-s",  "ls"], stdout=f)

Will you get the output of that python script inside outtest.txt?

ngessert commented 1 year ago

Hi @jkhenning,

adding that snippet to code results in nothing being written to outtest.txt. However, using

subprocess.call(["ls", "-l"], stdout=f)

correctly wrote the contents to the file. Not sure how to interpret this though.