allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.69k stars 655 forks source link

Deadlock when uploading debug samples after dataset sync #1034

Open materight opened 1 year ago

materight commented 1 year ago

Describe the bug

When using a Dataset instance to upload a dataset, and then initializing a new task that calls report_image, the image is not uploaded and the call to .close() hangs indefinitely.

By debugging it, It seems that the close function gets stuck in this loop: https://github.com/allegroai/clearml/blob/1ccdff5e77d2140844b3cc2b313d386e3c9bcc6b/clearml/backend_interface/metrics/reporter.py#L128-L130

To reproduce

import os
import cv2
import numpy as np
from clearml import Dataset, Task

img = np.random.randint(0, 255, size=(100, 100, 3), dtype=np.uint8)
dataset_dir = '/tmp/test_dataset'
os.makedirs(dataset_dir, exist_ok=True)
cv2.imwrite(f'{dataset_dir}/img.png', img)

dataset = Dataset.create(dataset_project='TEST', dataset_name='test_dataset')
dataset.sync_folder(dataset_dir)
dataset.upload()
dataset.finalize()

task = Task.init(project_name='TEST', task_name='test_task', reuse_last_task_id=False)
task.get_logger().report_image('test', 'val', image=img)
task.close()

Expected behaviour

The debug sample should be uploaded and the task should be properly closed.

Environment

jkhenning commented 1 year ago

Hi @materight ,

Thanks, we'll take a look and update

jkhenning commented 1 year ago

Hi @materight , I tried this code in python 3.7 and python 3.10, both with the latest clearml RC and with clearml 1.11.0 but it works for me - perhaps I'm missing something?

materight commented 1 year ago

Hi @jkhenning, thanks for the update. I'm not sure, I'm uploading to gcp if that makes any difference.

Btw in the meantime I found that removing the call to _report_dataset_preview in upload fixes the issue. So for now I added dataset._report_dataset_preview = lambda : None as a quick fix.

jkhenning commented 1 year ago

Thanks, that's a good insight. I think it might also be OS related, so I'll try some more