allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.71k stars 657 forks source link

Slow handling of cached files with large`cache_file_limit` #1352

Open materight opened 4 days ago

materight commented 4 days ago

Describe the bug

When the cache_file_limit is set to a large value, e.g. 10k, calls to StorageManager.get_local_copy gets extremely slow, even if all the files are already available in the cache.

By profiling, it seems that this call to iterdir() is the main bottleneck. If there are a lot of small files in cache, and get_local_copy is called for each of them, iterating over all the files on each call is too slow.

To reproduce

Expected behaviour

If all the files are already available in cache, the second run should almost be immediate. Instead it can take minutes.

Since iterating over the files seems to be needed only for deleting old files if the cache is full, maybe there could be a parameter to disable this logic and another method to trigger it manually.

Environment