allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.69k stars 654 forks source link

ClearML does not log after while #317

Open daandres opened 3 years ago

daandres commented 3 years ago

Hello,

I use ClearML with pytorch 1.7.1, pytorch-lightning 1.2.2 and Tensorboard auto logging. All ClearML has the latest stable updates. (clearml 1.7.4, clearml-agent 1.7.2)

Just since a few days my runs won't get logged correct anymore. I want to train 1500 epoch, but after around 500-700 epochs no more logs will be shown in ClearML. There is also no console log more in in WebUI for this task. If I opened the pure Tensorboard logs, all seems to be fine.

In my console log on my runners machine, there is no ClearML message, that indicates anything wrong (actually there is only the setup log at beginning of my script). WebUI says updated las X hours, but the last visible change (last console output in WebUI is X+Y hours)

under /opt/clearml/logs/apiserver.log there is also no message indicating an issue.

How can I find about the error and fix it?

Thanks

bmartinn commented 3 years ago

Hi @daandres

A few questions on the setup itself, what's the OS / Python version ? Are you running inside a docker ? Is the is manually execution or executed by an agent ?

Just since a few days my runs won't get logged correct anymore.

Anything specific that you can think of that happened ? Maybe package update ?

WebUI says updated last X hours, but the last visible change (last console output in WebUI is X+Y hours)

Would you expect it to say "last few seconds" i.e. is it running ?

In v0.17.5 (still in RC) we redesigned the reporting mechanism, could you check if the issue persists with the latest RC?

pip install clearml==0.17.5rc5
daandres commented 3 years ago

I'll give the 0.17.5rc5 a chance and will report my findings.

There were some pip updates recently (mainly pytorch-lightning to 1.2.2), and OS updates (nvidia) as well. I updated a week ago the clearml-server as mentioned in issue #314. As I remeber before this it worked well.

On the Web UI the task state is updated properly (running state with last updated time "a few seconds ago"), but content has not been updated as this is the issue)

Edit: I tested 0.17.5rc5 and it made it even worse. Logs are updated only for 30 minutes, but state in Experiment overview keep updated "a few seconds ago". Previously it work until 5 hours from start. now 30 minutes from start.

bmartinn commented 3 years ago

Thanks @daandres

On the Web UI the task state is updated properly (running state with last updated time "a few seconds ago")

This means the keep-alive thread is running correctly, and the process is alive.

Logs are updated only for 30 minutes,

Is this all metrics, or the just console ?

There were some pip updates recently (mainly pytorch-lightning to 1.2.2), and OS updates (nvidia) as well.

I guess "pytorch-lightning" update might have an effect, but then I would expect it to either log or not work at all... Could you think of a toy code that would reproduce this issue ?

Could you send the full execution console log (maybe the short one, from the 0.17.5rc5 test), hopefully we could get some idea on what's going on...

daandres commented 3 years ago

Logs are updated only for 30 minutes,

Is this all metrics, or the just console ?

nothing is updated, no console log, no scalars, no plots, no debug_samples. Thread is still active an keeps the Task in running mode and the overview table says "last updated a few seconds ago"

There were some pip updates recently (mainly pytorch-lightning to 1.2.2), and OS updates (nvidia) as well.

I guess "pytorch-lightning" update might have an effect, but then I would expect it to either log or not work at all... Could you think of a toy code that would reproduce this issue ?

I will test a toy sample to reproduce. As well will I test it with previous pytorch-lightning version.

Could you send the full execution console log (maybe the short one, from the 0.17.5rc5 test), hopefully we could get some idea on what's going on...

I will post a log. But I haven't see any output that should not be there, just my learning progress.

bmartinn commented 3 years ago

nothing is updated, no console log, no scalars, no plots, no debug_samples. Thread is still active an keeps the Task in running mode and the overview table says "last updated a few seconds ago"

This sounds like the reporting threads died for some reason, no idea how that could happen :(

I will test a toy sample to reproduce. As well will I test it with previous pytorch-lightning version. I will post a log. But I haven't see any output that should not be there, just my learning progress.

Many thanks @daandres !

daandres commented 3 years ago

maybe I found an issue that could be the problem. I'm not sure if its the case for my current project, but in a test model it is a problem. I use in both (test and project) many debug samples. Also Pytorch Lightning produces possibly many checkpoints of my model, which are registered. Maybe there is a bottleneck in processing incoming data in the server component. In test project I see slowly updating the logs, when the Tab is active, but shown epochs are far behind current processing.

Attached the code (it is a modified version of BoringModel from Pytorch Lightning and a wrapper from me).

import torch
from torch.utils.data import Dataset
import numpy as np

from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.loggers import TensorBoardLogger
from pytorch_lightning import LightningModule, metrics, Trainer

from clearml import Task

from pytorch.helper.pytorch_lightning.bug_report_model import RandomDataset, BoringModel

class RandomDataset(Dataset):

    def __init__(self, length, width, height):
        self.len = length
        self.imgs = list()
        self.labels = list()
        self.width = width
        self.height = height
        for i in range(length):
            img = torch.from_numpy(np.random.rand(1, self.width, self.height)).float()
            label = torch.from_numpy(np.ones((self.width, self.height), np.uint8)).long()
            self.imgs.append(img)
            self.labels.append(label)

    def __getitem__(self, index):
        return self.imgs[index], self.labels[index]

    def __len__(self):
        return self.len

class BoringModel(LightningModule):

    def __init__(self):
        super().__init__()
        self.num_classes = 2
        self.layer = torch.nn.Conv2d(1, self.num_classes, kernel_size=3, stride=1, padding=1, bias=True)
        self.val_iou = metrics.classification.IoU(self.num_classes)
        self.criterion = torch.nn.CrossEntropyLoss()

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        logits = self.layer(batch[0])
        loss = self.criterion(logits, batch[1])
        self.log("train/loss", loss, on_step=False, on_epoch=True)
        for i in range(batch[0].shape[0]):
            self.logger.experiment.add_image(f'train/{batch_idx}/{i}', batch[0][i, ...], self.current_epoch)
        return {"loss": loss}

    def validation_step(self, batch, batch_idx):
        logits = self.layer(batch[0])
        loss = self.criterion(logits, batch[1])
        _, preds = torch.max(logits, 1)
        iou = self.val_iou(preds, batch[1])

        self.log('val/iou', iou, on_step=False, on_epoch=True)
        for i in range(batch[0].shape[0]):
            self.logger.experiment.add_image(f'val/{batch_idx}/{i}', batch[0][i, ...], self.current_epoch)
        return {"x": loss, "val/iou": iou}

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.layer.parameters(), lr=0.1)
        return [optimizer]

if __name__ == '__main__':
    name = "TestBoringModel2"
    task = Task.init(project_name="Test", task_name=name)
    task.mark_started(True)

    train_data = torch.utils.data.DataLoader(RandomDataset(64, 256, 256), num_workers=8, batch_size=8)
    val_data = torch.utils.data.DataLoader(RandomDataset(16, 256, 256), num_workers=8, batch_size=8)
    test_data = torch.utils.data.DataLoader(RandomDataset(16, 256, 256), num_workers=8, batch_size=8)
    model = BoringModel()

    checkpoint_callback = ModelCheckpoint(
        verbose=False,
        monitor='val/iou',
        save_last=True,
        save_top_k=1,
        mode='max'
    )
    logger = TensorBoardLogger("/tmp/tensorboard_logs/", name=name, version=0)

    trainer = Trainer(max_epochs=150001,
                         gpus=1,
                         check_val_every_n_epoch=5,
                         callbacks=[checkpoint_callback],
                         logger=logger
                         )
    results = None
    trainer.fit(model, train_data, val_data)

    task.close()

Local console out: Pastebin Content On local I'm on epoch 2399 while on web at 259 (round about 4 hours ahead).

Maybe the test model is heavy logging, but in a real world model I do not have that much logging images.

bmartinn commented 3 years ago

Hmm @daandres that makes sense, it is not really "stuck" as much as it is uploading a lot of images/models. Could it be related to #310 ? Should we somehow drop them? Or do we want to create a priority for the console logs vs images ?

daandres commented 3 years ago

Yes I think it is related to #310 for the test case. For my original project I have to investigate, if this is really the problem. I will disable Model Upload via the Callback mentioned in the other issue.

I would prefer to have log output and scalar logging as high priority, and throttle data uploading (samples and artifacts) down.

bmartinn commented 3 years ago

I would prefer to have log output and scalar logging as high priority, and throttle data uploading (samples and artifacts) down.

These are actually two different throttles / priorities we might need to add, both are important, but nonetheless different. Which one do you think is the cause for the slowdown in your specific case ?

daandres commented 3 years ago

I think model artifact uploading is the main issue. I haven't test it much with the new rc6, but first try went well.

bmartinn commented 3 years ago

Thanks @daandres , If the conclusion is that the culprit is the model upload, let's continue the discussion over here #310