allegroai / clearml-server

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Other
381 stars 132 forks source link

Elasticsearch Error Trying to create too many buckets. #89

Closed czotti closed 1 year ago

czotti commented 2 years ago

This issue have been discussed here https://clearml.slack.com/archives/CTK20V944/p1633012994173400.

Versions

When I try to delete some experiment this error message appear on the clearml-server interface.

General data error (TransportError(503, 'search_phase_execution_exception', 'Trying to create too many buckets. Must be less than or equal to: [10000] but was [10001]. This limit can be set by changing the [search.max_buckets] cluster level setting.'))

I found that the issue come from Elasticsearch. We can increase the search.max_buckets but I read that it's not recommended.

Here is the code to reproduce this issue

#!/usr/bin/env python3
import time
from clearml import Task
import numpy as np
from skimage import data
from skimage.transform import resize
​
​
def main():
    task = Task.init("Report_images", "reporting_images", Task.TaskTypes.testing)
    logger = task.get_logger()
​
    buffer = np.zeros((256, 256 * 3, 3))
    buffer[:, :256] = resize(data.astronaut(), (256, 256, 3))
    max_epoch = 150
    epoch_length = 111
    iteration = 0
    rng = np.random.RandomState(42)
    for epoch in range(1, max_epoch + 1):
        losses = []
        metrics = []
        # Training
        for mini_batch in range(epoch_length):
            losses.append(rng.randn())
            metrics.append(rng.randn(2))
            logger.report_scalar("Iteration metric", "background", metrics[-1][0], iteration)
            logger.report_scalar("Iteration metric", "forground", metrics[-1][1], iteration)
            logger.report_scalar("Iteration loss", "loss", losses[-1], iteration)
            buffer[:, 256:] = rng.rand(256, 512, 3)
            logger.report_image("train", f"train_{epoch:03d}", iteration=mini_batch, image=buffer)
            iteration += 1
            time.sleep(0.05)
        task.flush(wait_for_uploads=True)
​
        mean_metrics = np.mean(metrics, axis=0)
        mean_loss = np.mean(losses)
        logger.report_scalar("Average metric", "background", mean_metrics[0], epoch)
        logger.report_scalar("Average metric", "foreground", mean_metrics[1], epoch)
        logger.report_scalar("Average loss", "loss", mean_loss, epoch)
​
​
if __name__ == "__main__":
    main()
czotti commented 1 year ago

We update our server installation to clearml-server (1.8.0) and everything is fine now.