allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.6k stars 651 forks source link

"Monitor metrics" strange behavior: it logs only highest metrics, not all of them #844

Open korotaS opened 1 year ago

korotaS commented 1 year ago

Hello! I've been using ClearML pipelines for training some nets and I've stumbled upon a strange behavior. Let's say I have a simple pipeline from one function step:

from clearml import PipelineController

def function_step():
    from time import sleep
    from clearml import Task

    logger = Task.current_task().get_logger()
    for i in range(100, 0, -1):
        # iteration goes like [0, 1, 2, 3, 4, 5]
        # value goes like [100, 99, 98, 97, 96]
        logger.report_scalar('metrics', 'metric1', i, 100-i)
        sleep(5)

pipe = PipelineController(
    project='project',
    name='name',
    version='0.0.1',
)
pipe.add_function_step(
    name="function",
    function=function_step,
    monitor_metrics=[("metrics", "metric1")]
)
pipe.start_locally(run_pipeline_steps_locally=False)

Also I want to monitor one metric which I passed to monitor_metrics parameter. After pipeline execution all I see in the pipeline's scalars is one point with iteration=0 and value=100 instead of a line from 100 points (which are in task's scalars).

I've found the peace of code which monitors metrics from task and logs them to pipeline and here's how it looks like: https://github.com/allegroai/clearml/blob/52a592c4bfa0f39367c0fa576f2f0817555fb96c/clearml/automation/controller.py#L2494-L2510

The behavior is kind of strange. Suppose we have _monitor_node_interval=30. So every 30 seconds this code takes the last point from scalars (how about the other data points?), checks if value is greater than last saved value and if so, logs it into the pipeline logger. So I have 2 questions:

  1. Is it ok that we compare y and last_y (current value vs highest value), but not x and last_x (current iteration vs last iteration)?
  2. Is it ok that we skip a lot of data points and log only last? If _monitor_node_interval is default (5*60) than, say, 100 points are logged to tasks's scalars and only one is logged into pipeline's scalars.

I believe that both of this issues can be solved quickly. In fact I've already patched that part with my custom code, which looks like this:

...
# update the metrics
if node.monitor_metrics:
    metrics_state = self._monitored_nodes[node.name].get('metrics', {})
    logger = self._task.get_logger()
    scalars = task.get_reported_scalars(x_axis='iter')
    for (s_title, s_series), (t_title, t_series) in node.monitor_metrics:
        values = scalars.get(s_title, {}).get(s_series)
        if values and values.get('x') is not None and values.get('y') is not None:
            xs = values['x']  # take all Xs
            ys = values['y']  # take all Ys
            last_x = metrics_state.get(s_title, {}).get(s_series)
            for x, y in zip(xs, ys):
                if last_x is None or x > last_x:
                    # log all data points which have iteration higher than last_x
                    logger.report_scalar(title=t_title, series=t_series, value=y, iteration=int(x))
            last_x = xs[-1]  # save last logged iteration
            if not metrics_state.get(s_title):
                metrics_state[s_title] = {}
            metrics_state[s_title][s_series] = last_x
...

I would be glad to make a PR if this is a valid issue, otherwise let's discuss it.

erezalg commented 1 year ago

Hi @korotaS,

Thanks for raising this, suggesting a fix and my appologies for the late reply.

Before we dive into your solution I'd be happy to understand from your point of view what's the desired solution:

  1. Would you like to get a graph of the metric over time? Or only the "best" data point is good?
  2. Would you have need for metrics that support saving the "highest" (Like accuracy), "lowest" (like loss) AND last? Meaning, when defining the metric to monitor, you'll also provide what's the desired trend?
  3. In your use case, do you have a few pipeline steps that would report this metric? Or only a single one?

Thanks!

korotaS commented 1 year ago

Hi @erezalg, thanks for the reply!

  1. In my opinion, the default behavior should be so that all of the data points from all metrics which are provided in monitor_metrics will show as a graph in pipeline's task scalars tab exactly as they appear in every step's task scalars tab. But on the other hand I have a concern that dumping all data points to pipelines's scalars may cause some performance issues if there are thousands of data points, so this behavior should be tested (I didn't face this problem in my use case though).
  2. I think it is a good idea to choose desired trend - for example, add another (optional) element to tuple in monitor_metrics list with a string like 'highest', 'lowest' etc (or for example like in Pytorch Lightning checkpoint callback you provide parameter mode with possible values max or min). But again, we could avoid that just by dumping every data point, not the monotonous increasing/decreasing sequence of them.
  3. I have a few pipeline steps which dump the same metric and concretely in my use case I use monitor_metrics to compare the same metric between two steps, which is very easy to compare in pipeline's scalars tab (I check whether the loss and fscore stay good in training first step and QAT second step to see if my quantization algorithm works correctly).
Ruhrozz commented 2 hours ago

Same problem here, any updates?