allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.69k stars 655 forks source link

External files: Improve performance #962

Closed john-zielke-snkeos closed 1 year ago

john-zielke-snkeos commented 1 year ago

Add option for faster .add_external_files() option Add option for parallel external file download

add_external_files: Add the option for specifying one dataset_path per entry in source_url. This was multiple files with different dataset_paths can be added in one batch, otherwise the update of the dataset state introduces a huge overhead for large number of files with different dataset paths

Parameter max_workers_external_files: Allow a parallel download of external files (especially relevant for smaller files). For backwards compatibility, this is introduced as a new parameter. It could just use the value of the existing parameter if desired. I am not sure how the internal threadpool of the boto library plays together with this in the case of larger files.

Related Issue \ discussion

If this patch originated from a github issue or slack conversation, please paste a link to the original discussion

Patch Description

This PR adds a few options to allow for faster download of external files as well as for faster adding of external files to the dataset.

Testing Instructions

Instructions of how to test the patch or reproduce the issue the patch is aiming to solve

Other Information

jkhenning commented 1 year ago

Hi @john-zielke-snkeos , sorry for the delay, we'll try this out soon 👍

AlexandruBurlacu commented 1 year ago

Hey @john-zielke-snkeos, this is a really nice improvement you have here. While testing it before approving the PR, we identified that this code breaks a few of our tests. Here's a test snippet you can use to check it yourself:

import os

from clearml import Dataset

if __name__ == "__main__":
    dataset = Dataset.create(dataset_name="test_external_files_http", dataset_project="test")
    dataset.add_external_files(
        "https://github.com/allegroai/clearml/releases/download/v1.1.6/clearml-1.1.6-py2.py3-none-any.whl")
    dataset.upload()
    dataset.finalize()

    dataset = Dataset.get(dataset.id)
    copy_path = dataset.get_local_copy()
    copied_files = os.listdir(copy_path)
    for copied_file in copied_files:
        assert os.path.getsize(copy_path + "/" + copied_file) != 0

And it fails with the traceback:

File "external_files_http.py", line 8, in <module>
    "https://github.com/allegroai/clearml/releases/download/v1.1.6/clearml-1.1.6-py2.py3-none-any.whl")
File "/opt/atlassian/pipelines/agent/build/clearml/datasets/dataset.py", line 465, in add_external_files
    if len(dataset_path) != len(source_url):
TypeError: object of type 'NoneType' has no len()
AlexandruBurlacu commented 1 year ago

One more thing. There's one part of this PR I would have done differently - it's max_workers_external_files. I strongly believe this is a detail the end user shouldn't care much about. I suggest instead selecting the number of threads allocated for external files vs internal files in a proportional manner and letting the user specify only the total number of threads they want for this job.

john-zielke-snkeos commented 1 year ago

One more thing. There's one part of this PR I would have done differently - it's max_workers_external_files. I strongly believe this is a detail the end user shouldn't care much about. I suggest instead selecting the number of threads allocated for external files vs internal files in a proportional manner and letting the user specify only the total number of threads they want for this job.

I see your point, and was considering solutions for this as well. A proportional setting in the same location would still introduce the same manual tuning necessary, right? And I am not sure I understand your point in regards to the total number of threads. AFAIK, the two types of downloads are executed sequentially, so IMO a single number determining the shared number of total threads would only have limited meaning. And given that there are different functions and APIs for adding external files, having a corresponding setting here should be relatively intuitive for most users. (And to make sure I fully understand your point: In what way does max_workers and the max_workers_external_files setting differ from each other? Both seem like relatively low-level settings that many users might not change by default)

This also raises the question of defaults. Right now the default number of external stays at 1, so the user needs to increase this to benefit from the new feature. This keeps backwards compatibility but might cause users not to use this feature, which I think is most beneficial if you have many small files. So maybe it would make sense to set this to a higher number by default. Maybe the previously mentioned max_workers might instead be used as the default, although downloading external files should be a mainly IO bound process, so I am not sure how much sense it makes to use the # of cpu cores as the default.

john-zielke-snkeos commented 1 year ago

Hey @john-zielke-snkeos, this is a really nice improvement you have here. While testing it before approving the PR, we identified that this code breaks a few of our tests. Here's a test snippet you can use to check it yourself:

import os

from clearml import Dataset

if __name__ == "__main__":
    dataset = Dataset.create(dataset_name="test_external_files_http", dataset_project="test")
    dataset.add_external_files(
        "https://github.com/allegroai/clearml/releases/download/v1.1.6/clearml-1.1.6-py2.py3-none-any.whl")
    dataset.upload()
    dataset.finalize()

    dataset = Dataset.get(dataset.id)
    copy_path = dataset.get_local_copy()
    copied_files = os.listdir(copy_path)
    for copied_file in copied_files:
        assert os.path.getsize(copy_path + "/" + copied_file) != 0

And it fails with the traceback:

File "external_files_http.py", line 8, in <module>
    "https://github.com/allegroai/clearml/releases/download/v1.1.6/clearml-1.1.6-py2.py3-none-any.whl")
File "/opt/atlassian/pipelines/agent/build/clearml/datasets/dataset.py", line 465, in add_external_files
    if len(dataset_path) != len(source_url):
TypeError: object of type 'NoneType' has no len()

Thank you for finding this, I will fix it.

AlexandruBurlacu commented 1 year ago

AFAIK, the two types of downloads are executed sequentially, so IMO a single number determining the shared number of total threads would only have limited meaning.

You are correct, my bad. Indeed they run sequentially, so in principle we can use the same max_workers for both external and internal file downloads.

And to make sure I fully understand your point: In what way does max_workers and the max_workers_external_files setting differ from each other? Both seem like relatively low-level settings that many users might not change by default

They both are low-level, indeed. But most users are already familiar with parameters like n_jobs, n_workers, max_workers and so on. So it's easy to understand. But having two pools of workers/threads/jobs/etc is less familiar and can lead to confusion. Besides, it's leaking implementation details to the user, like the fact that internal and external file downloads are handled by different thread pools. So in the future if we ever decide to change the implementation, we'll basically have to keep these two parameters (max_workers and max_workers_external_files) for compatibility reasons

[...] although downloading external files should be a mainly IO bound process, so I am not sure how much sense it makes to use the # of cpu cores as the default.

The optimal number of threads is dependent on a variety of factors, and for optimal performance it requires at least some benchmarking and tuning. Using N threads == N logical CPU cores is just a (very) safe default so that the users can benefit from parallel downloads without any intervention.

john-zielke-snkeos commented 1 year ago

So do you think just using the same max_workers for both would be acceptable? I also think that would be a reasonable solution.

jkhenning commented 1 year ago

So do you think just using the same max_workers for both would be acceptable? I also think that would be a reasonable solution.

@john-zielke-snkeos I agree, let's do that and I'll merge it

jkhenning commented 1 year ago

Hi @john-zielke-snkeos, apologies, I found one little thing to fix (added a comment) 🙏