Open ErenBalatkan opened 3 years ago
Hi @ErenBalatkan ,
Where do you upload the files? The ClearML files server? Or some cloud provider? If you are using the ClearML files server, are you using an on-prem setup? app.community.clear.ml?
About the uploading. did you use the CLI for it? Or from code with the SDK?
Hello,
I'm using self hosted ClearML file server hosted in our computer. Files server, Api server and Web server are on same computer. I'm using the docker-compose provided here for server setup. Ports are modified to use 8000, 8001 and 8002. Storage location is modified to use mounted hard drive.
https://allegro.ai/clearml/docs/docs/deploying_clearml/clearml_server_linux_mac.html
I'm using CLI for dataset creation and upload, specifically, following commands
sudo clearml-data create --project ICCV --name iccv_dataset_v0
sudo clearml-data add --files iccv
sudo clearml-data close --verbose
Error happens after compression is done, here are the last few lines before the error
Compressing /media/HDD4TB/eren/Data/iccv/test/24b334bdc1264482dec881e0fdead5b0.jpg
Compressing /media/HDD4TB/eren/Data/iccv/train/979/9690476.jpg
Compressing /media/HDD4TB/eren/Data/iccv/train/587/4820172.jpg
Compressing /media/HDD4TB/eren/Data/iccv/train/559/4500392.jpg
Compressing /media/HDD4TB/eren/Data/iccv/test/dd6c2917a0572e1e16dd367d1b939da5.jpg
Compressing /media/HDD4TB/eren/Data/iccv/train/299/19270210.jpg
Compressing /media/HDD4TB/eren/Data/iccv/train/390/2190469.jpg
Compressing /media/HDD4TB/eren/Data/iccv/train/985/9750250.jpg
Compressing /media/HDD4TB/eren/Data/iccv/val/351/19770026.jpg
Compressing /media/HDD4TB/eren/Data/iccv/train/202/1830348.jpg
Compressing /media/HDD4TB/eren/Data/iccv/val/528/4120098.jpg
Uploading compressed dataset changes (529531 files, total 34.03 GB) to http://161.116.84.196:8002
2021-07-14 14:44:28,445 - clearml.storage - ERROR - Exception encountered while uploading
Error:
To be on the safe side, I have already configured Non-responsive Task watchdog to 24 hours. There are no new entries on fileserver log. On dataset storage path on server, in generates a new folder with /artifacts/state/state.json object.
At first I assumed that It was just a faulty setup on my end but if that were simply the case then I would also expect it to crash for small datasets as well. I will be performing some additional tests in hopes of pinpointing the exact setting that causes this crash.
Update: No problem with dataset sizes up to 5gb
Hi @ErenBalatkan ,
I will try to reproduce it on my side (I have 47G dataset for this). Any other tips for reproducing except the dataset size?
It could also be related to the total number of images in dataset - which is around 50K
I tested this again with a dataset size of 20gb and that has around 315K images and it works fine.
Could this be related to low main drive storage on hosted ClearML server? Will try to test again with full dataset after we cleanup some storage on server's main drive.
(All the cache paths on config file are already modified to use 4TB hard disk, but my guess is that ClearML interacts with tmp folder on server?)
ClearML interacts with tmp folder on server?
Maybe while uploading? just making sure, the target output_uri
is the clearml file server, correct ?
BTW, it seems there is no "magic" in the file server, it will just store the entire file as is:
https://github.com/allegroai/clearml-server/blob/09ab2af34cbf9a38f317e15d17454a2eb4c7efd0/fileserver/fileserver.py#L41
Hi @ErenBalatkan ,
I tried to reproduce this issue with big dataset (~50G) but everything works. From reading the last comments I understand everything works for you too now.
Any other hints how can I reproduce it? Or should I close this issue?
Hi @ErenBalatkan ,
I tried to reproduce this issue with big dataset (~50G) but everything works. From reading the last comments I understand everything works for you too now.
Any other hints how can I reproduce it? Or should I close this issue?
Hello,
I managed to find a work-around to this problem by uploading each subfolder in dataset as a seperate dataset and then merging them on training script. I'm not entirely sure if this issue happens due to a faulty setup on my side or is it an actual bug but since it is difficult to reproduce, I guess you can close this issue.
If I have a better clue as into exactly what is causing this issue I will let you guys know.
I'm not entirely sure if this issue happens due to a faulty setup on my side or is it an actual bug but since it is difficult to reproduce, I guess you can close this issue.
I think this might be caused by a flaky internet connection, dropping in the middle of the upload. Maybe it makes sense to add a parameter for chuck size, so the 50GB will be split into multiple zip files of size X. wdyt?
Sorry for the late reply,
Split uploading seems like It could be beneficial for debugging and especially beneficial for ultra-large datasets. I guess it would also help with understanding what is happening in our case. But both the clearml-server and the machine from which I'm uploading the dataset are on same local network so I highly doubt if the issue was network connection.
Hi @ErenBalatkan Are you still getting the error ? Is it reproducible ?
But both the clearml-server and the machine from which I'm uploading the dataset are on same local network so I highly doubt if the issue was network connection.
What are your thoughts on the potential root cause then?
I can reproduce the issue every time I try to upload the entire dataset, however I'm not entirely sure about the root cause.
Specifically, this is the dataset I'm having issues with. https://www.kaggle.com/c/largefinefoodai-iccv-recognition/overview
I put the dataset in structure like this
ICCV
Train Val Test
If I upload Train, Test and Val seperately it works with no problems. If I try to upload ICCV it doesn't.
And the target upload is the clearml
files-server
, is that correct ?
Could it be some memory issue of the files-server
(i.e. running out of memory before flushing the entire file)?
Could you attach the server logs while you are uploading the large file?
On a custom ClearML Server, I'm getting the following error while uploading a big dataset
(There is no explanation after "Error:" )
Dataset is from the ICCV challange https://www.kaggle.com/c/largefinefoodai-iccv-recognition
System works perfectly fine for small datasets.
Since error message does not indicate exactly what is the problem, would it be possible to find a clue as in to why is this happening? Have you guys encountered any similar error's in past?