allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.69k stars 655 forks source link

Forced stop to upload large (100gb) dataset #1064

Open clementruhm opened 1 year ago

clementruhm commented 1 year ago

Describe the bug

I am trying to upload the dataset to self-hosted clearml server:

id=$(clearml-data create --project balacoon --name libritts --version 0.0 --tags raw 2>&1 | grep "dataset created id" | awk -F'=' '{print $2}')
clearml-data add --files /home/clement/workspace/data/raw/libritts/LibriTTS --id $id
clearml-data close

On the client commands hang. In the web interface, the dataset creation is marked as "Aborted". In the console the last messages that I see:

2023-06-30 23:57:03
Generating SHA2 hash for 1136604 files
2023-07-01 00:01:37
Hash generation completed
2023-07-01 00:03:14
Uploading dataset files: {'show_progress': True, 'verbose': False, 'output_url': None, 'compression': None}

And under "Info" tab, I see:

STATUS MESSAGE:
Forced stop (non-responsive)
STATUS REASON:
Forced stop (non-responsive)

LibriTTS dataset is around 100 gB. The machine dedicated as clearml server is not the strongest, but I expected it to be good enough: Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz 8cores 15 gb RAM

Is there setting to increase time-out or give more resources to the service? Why copying data is such a heavy job that it gets killed?

Environment

energydrink9 commented 2 months ago

It's probably caused by the compression step: the program is not really hanging but it's slowly processing the large amount of files. I have a similar problem. Could we solve by disabling compression?