allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.72k stars 657 forks source link

How to upload datasets to remote S3 without compressing them? #1263

Open surya9teja opened 6 months ago

surya9teja commented 6 months ago

Hi, I have setup opensource version of clearml in Kubernetes cluster and doing some testing. I found out, when I upload my local dataset into clearml it compressed into a zip format. Is there any way I can upload files without compressing. Most of the my dataset comprises of images and pdfs.

from clearml import Dataset
dataset = Dataset.create(
    dataset_name="sample",
    dataset_project="test",
    output_uri="s3://sssss/clearml",
    description="sample testing dataset",
)

dataset.add_files(
    path="sample_dataset",
    wildcard="*.jpg",
    recursive=True,
)

dataset.upload(
    show_progress=True,
    verbose=True,
    compression=None,
    retries=3,
)

Also Can anyone point me to documentation for clearml in Kubernetes for settings up mangodb, redis to external instead of creating in cluster. And Does the file uploading have any API endpoint so that I can use in my current frontend setup.

jkhenning commented 4 months ago

Hi @surya9teja, currently bypassing compression is not supported, but it's a good idea, and we will add it in the next version 🙂

As for your other questions, see here for where to provide connection strings for external databases (instead of the ones automatically deployed by the clearml chart)

Regarding the file uploading, where in your frontend would you like to use it? The ClearML fileserver uses a simply HTTP Form upload using multipart

energydrink9 commented 3 months ago

Additionally, compression of all the files in large datasets comprised of thousands of small files is extremely slow. Disabling compression could solve the issue