allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.69k stars 655 forks source link

Darts Integration #974

Open MightyGoldenJA opened 1 year ago

MightyGoldenJA commented 1 year ago

Proposal Summary

Add specific Darts Time-Series Forecasting Library integration.

Motivation

For some reason even if Darts is built on top of libraries supported by ClearML, the monitor fails to correctly capture the scalars (at least on Temporal Fusion Transformers).

Related Discussion

https://clearml.slack.com/archives/CTK20V944/p1680610946630759

ainoam commented 1 year ago

Thanks for suggesting @MightyGoldenJA,

As you note, ClearML should pick up on the underlying logging e.g. any snapshots your training saves or metrics logged to Tensorboard.

How did you log your scalars when training the Darts TFT model?

MightyGoldenJA commented 1 year ago

@ainoam I did mot manage to pick up scalars I just adjusted the log frequency to be able to follow progress on the console.

ainoam commented 1 year ago

Not sure I follow @MightyGoldenJA - Are we speaking only on console log here? Do you mean that unless you modify the logging frequency, console outputs appear but are not captured by ClearML?

MightyGoldenJA commented 1 year ago

@ainoam I was meaning I managed to get console log but I did not managed to capture scalars metrics like loss, val_loss, etc...

ainoam commented 1 year ago

@MightyGoldenJA How did you log your scalars? Report to Tensorboard?

ainoam commented 1 year ago

@MightyGoldenJA where did you report your metrics to? Tensorboard? local file?

MightyGoldenJA commented 1 year ago

@ainoam As described in the linked Slack thread, I used a PyTorch Lightning trainer using the default PL logger, since my trainings using PyTorch Lightning on other projects always got their scalars and metrics properly captured by ClearML I found surprising that it wasn't the case for Darts even if the TFT model is PyTorch-based and is trained using a PL Trainer. image

AlexandruBurlacu commented 1 year ago

Hey @MightyGoldenJA Can you please let us know what pytorch-lightning and PyTorch version you have installed for the failing example?

MightyGoldenJA commented 1 year ago

Hey @AlexandruBurlacu the tested versions are torch==2.0.1 and pytorch-lightning==2.0.2

AlexandruBurlacu commented 1 year ago

Hey @MightyGoldenJA, we had some issues with pytorch-lightning>=2.0.0, but we fixed them in clearml==1.11.1rc2. Can you please install it and see whether it fixes your problem?

MightyGoldenJA commented 1 year ago

@AlexandruBurlacu With clearml==1.11.1rc2 not only the scalar are not captured and the PL trainer logs are no longer captured either (we had to rollback to get back the functional log capture).

I can't pass the ClearML logger in the logger param of my PL trainer without triggering a concurrency exception I guess I will have to do my own PR on Darts or ClearML side if I want this to work before the end of the year....

MightyGoldenJA commented 1 year ago

Kay by manually defining a custom PL logger and passing it to the trainer I managed to log scalars, but this is not normal behavior ClearML is supposed to auto-connect to PyTorch, hence I let you (@AlexandruBurlacu) close this issue if you do not think this is a problem.

class ClearMLLogger(Logger):

    @property
    def name(self):
        return 'ClearMLLogger'

    @property
    def version(self):
        return '0.0.1'

    @rank_zero_only
    def log_hyperparams(self, params):
        task = clearml.Task.current_task()
        task.connect(params, name='Hyperparameters')

    @rank_zero_only
    def log_metrics(self, metrics, step):
        task = clearml.Task.current_task()
        for name, metric in metrics.items():
            task.get_logger().report_scalar(title=name, series=name, value=metric, iteration=step)