allegroai / clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
https://clear.ml/docs
Apache License 2.0
5.63k stars 652 forks source link

Error: Exception encountered while uploading dataset #401

Open ErenBalatkan opened 3 years ago

ErenBalatkan commented 3 years ago

On a custom ClearML Server, I'm getting the following error while uploading a big dataset

Uploading compressed dataset changes (529531 files, total 34.03 GB) to http://161.116.84.196:8002
2021-07-14 14:44:28,445 - clearml.storage - ERROR - Exception encountered while uploading

Error:

(There is no explanation after "Error:" )

Dataset is from the ICCV challange https://www.kaggle.com/c/largefinefoodai-iccv-recognition

System works perfectly fine for small datasets.

Since error message does not indicate exactly what is the problem, would it be possible to find a clue as in to why is this happening? Have you guys encountered any similar error's in past?

JDennisJ commented 3 years ago

Hi @ErenBalatkan ,

Where do you upload the files? The ClearML files server? Or some cloud provider? If you are using the ClearML files server, are you using an on-prem setup? app.community.clear.ml?

About the uploading. did you use the CLI for it? Or from code with the SDK?

ErenBalatkan commented 3 years ago

Hello,

I'm using self hosted ClearML file server hosted in our computer. Files server, Api server and Web server are on same computer. I'm using the docker-compose provided here for server setup. Ports are modified to use 8000, 8001 and 8002. Storage location is modified to use mounted hard drive.

https://allegro.ai/clearml/docs/docs/deploying_clearml/clearml_server_linux_mac.html

I'm using CLI for dataset creation and upload, specifically, following commands

sudo clearml-data create --project ICCV --name iccv_dataset_v0
sudo clearml-data add --files iccv
sudo clearml-data close --verbose

Error happens after compression is done, here are the last few lines before the error

Compressing /media/HDD4TB/eren/Data/iccv/test/24b334bdc1264482dec881e0fdead5b0.jpg
Compressing /media/HDD4TB/eren/Data/iccv/train/979/9690476.jpg
Compressing /media/HDD4TB/eren/Data/iccv/train/587/4820172.jpg
Compressing /media/HDD4TB/eren/Data/iccv/train/559/4500392.jpg
Compressing /media/HDD4TB/eren/Data/iccv/test/dd6c2917a0572e1e16dd367d1b939da5.jpg
Compressing /media/HDD4TB/eren/Data/iccv/train/299/19270210.jpg
Compressing /media/HDD4TB/eren/Data/iccv/train/390/2190469.jpg
Compressing /media/HDD4TB/eren/Data/iccv/train/985/9750250.jpg
Compressing /media/HDD4TB/eren/Data/iccv/val/351/19770026.jpg
Compressing /media/HDD4TB/eren/Data/iccv/train/202/1830348.jpg
Compressing /media/HDD4TB/eren/Data/iccv/val/528/4120098.jpg
Uploading compressed dataset changes (529531 files, total 34.03 GB) to http://161.116.84.196:8002
2021-07-14 14:44:28,445 - clearml.storage - ERROR - Exception encountered while uploading

Error:

To be on the safe side, I have already configured Non-responsive Task watchdog to 24 hours. There are no new entries on fileserver log. On dataset storage path on server, in generates a new folder with /artifacts/state/state.json object.

At first I assumed that It was just a faulty setup on my end but if that were simply the case then I would also expect it to crash for small datasets as well. I will be performing some additional tests in hopes of pinpointing the exact setting that causes this crash.

Update: No problem with dataset sizes up to 5gb

JDennisJ commented 3 years ago

Hi @ErenBalatkan ,

I will try to reproduce it on my side (I have 47G dataset for this). Any other tips for reproducing except the dataset size?

ErenBalatkan commented 3 years ago

It could also be related to the total number of images in dataset - which is around 50K

ErenBalatkan commented 3 years ago

I tested this again with a dataset size of 20gb and that has around 315K images and it works fine.

Could this be related to low main drive storage on hosted ClearML server? Will try to test again with full dataset after we cleanup some storage on server's main drive.

(All the cache paths on config file are already modified to use 4TB hard disk, but my guess is that ClearML interacts with tmp folder on server?)

bmartinn commented 3 years ago

ClearML interacts with tmp folder on server?

Maybe while uploading? just making sure, the target output_uri is the clearml file server, correct ? BTW, it seems there is no "magic" in the file server, it will just store the entire file as is: https://github.com/allegroai/clearml-server/blob/09ab2af34cbf9a38f317e15d17454a2eb4c7efd0/fileserver/fileserver.py#L41

JDennisJ commented 3 years ago

Hi @ErenBalatkan ,

I tried to reproduce this issue with big dataset (~50G) but everything works. From reading the last comments I understand everything works for you too now.

Any other hints how can I reproduce it? Or should I close this issue?

ErenBalatkan commented 3 years ago

Hi @ErenBalatkan ,

I tried to reproduce this issue with big dataset (~50G) but everything works. From reading the last comments I understand everything works for you too now.

Any other hints how can I reproduce it? Or should I close this issue?

Hello,

I managed to find a work-around to this problem by uploading each subfolder in dataset as a seperate dataset and then merging them on training script. I'm not entirely sure if this issue happens due to a faulty setup on my side or is it an actual bug but since it is difficult to reproduce, I guess you can close this issue.

If I have a better clue as into exactly what is causing this issue I will let you guys know.

bmartinn commented 3 years ago

I'm not entirely sure if this issue happens due to a faulty setup on my side or is it an actual bug but since it is difficult to reproduce, I guess you can close this issue.

I think this might be caused by a flaky internet connection, dropping in the middle of the upload. Maybe it makes sense to add a parameter for chuck size, so the 50GB will be split into multiple zip files of size X. wdyt?

ErenBalatkan commented 3 years ago

Sorry for the late reply,

Split uploading seems like It could be beneficial for debugging and especially beneficial for ultra-large datasets. I guess it would also help with understanding what is happening in our case. But both the clearml-server and the machine from which I'm uploading the dataset are on same local network so I highly doubt if the issue was network connection.

bmartinn commented 3 years ago

Hi @ErenBalatkan Are you still getting the error ? Is it reproducible ?

But both the clearml-server and the machine from which I'm uploading the dataset are on same local network so I highly doubt if the issue was network connection.

What are your thoughts on the potential root cause then?

ErenBalatkan commented 3 years ago

I can reproduce the issue every time I try to upload the entire dataset, however I'm not entirely sure about the root cause.

Specifically, this is the dataset I'm having issues with. https://www.kaggle.com/c/largefinefoodai-iccv-recognition/overview

I put the dataset in structure like this

ICCV

Train Val Test

If I upload Train, Test and Val seperately it works with no problems. If I try to upload ICCV it doesn't.

bmartinn commented 3 years ago

And the target upload is the clearml files-server, is that correct ? Could it be some memory issue of the files-server (i.e. running out of memory before flushing the entire file)? Could you attach the server logs while you are uploading the large file?