Nota-NetsPresso / BK-SDM

A Compressed Stable Diffusion for Efficient Text-to-Image Generation [ECCV'24]
Other
238 stars 16 forks source link

data loading problem with 89M pairs #29

Closed youngwanLEE closed 1 year ago

youngwanLEE commented 1 year ago

Hi, thanks to your excellent work, I have conducted many experiments.

When I trained on a subset of LAION-aesthetic-5+ (about 89M pairs), my training process was killed without specific error message:(

Maybe it occurred at the load_dataset.

I guess that the number of training sets is too big, but I'm not sure.

I think this problem may be caused by the huggingface's dataset library.

Have you ever faced this problem? and have you tried to train your model on much bigger training set?

Thanks in advance :)

image

ThibaultCastells commented 1 year ago

Hello, thanks for utilizing our work 😊

~I have a few questions to better understand your issue:~

edit: sorry for misunderstanding the situation, I’ve checked your discussions [1 2]. We will get back to you soon.

bokyeong1015 commented 1 year ago

@youngwanLEE Thanks for sharing your update. Happy to know you are working with large-scale data :)

We haven't worked with a dataset as large as the one you're considering (for clarity, we used 0.22M or 2.3M pairs from LAION-Aesthetics V2).

We haven't encountered the errors you mentioned, sudden killed processes during data load.


Sorry for being unable to provide a clear opinion, because we haven't experimented with such large data using multi-GPU training; however, your point (“may be caused by the huggingface's dataset library.”) seems reasonable and may be due to multi-gpu loading for a huge dataset [1] [2-korean].

One suggestion would be to report this issue at https://github.com/huggingface/datasets. It would be very appreciated if you could generously share your update and/or solution to this issue.

youngwanLEE commented 1 year ago

@bokyeong1015 Thanks for reply :)

BTW, I wonder when your 2M dataset loading #15 issue will finish.

bokyeong1015 commented 1 year ago

@youngwanLEE Thank you for your inquiry :)

The 2.3M dataset is now downloadable, and please check this link if you are interested!

youngwanLEE commented 1 year ago

@bokyeong1015 Thanks!

When I tried to download the data, an error occurred:

--2023-09-03 08:36:39-- https://netspresso-research-code-release.s3.us-east-2.amazonaws.com/data/improved_aesthetics_6.5plus/preprocessed_2256k.tar.gz Resolving netspresso-research-code-release.s3.us-east-2.amazonaws.com (netspresso-research-code-release.s3.us-east-2.amazonaws.com)... 52.219.94.146, 52.219.108.82, 52.219.100.216, ... Connecting to netspresso-research-code-release.s3.us-east-2.amazonaws.com (netspresso-research-code-release.s3.us-east-2.amazonaws.com)|52.219.94.146|:443... connected. HTTP request sent, awaiting response... 403 Forbidden 2023-09-03 08:36:40 ERROR 403: Forbidden.

It may be caused by the same address as that of 11K or 212K .

bokyeong1015 commented 1 year ago

@youngwanLEE thanks for reaching out.

Based on the log message and as you correctly analyzed ("the same address as that of 11K or 212K"), the URL should be S3_URL="https://netspresso-research-code-release.s3.us-east-2.amazonaws.com/data/improved_aesthetics_6.25plus/preprocessed_2256k.tar.gz"

Could you kindly try out the above link?


FYI: the dataset details can be found in MODEL_CARD.md

youngwanLEE commented 1 year ago

@bokyeong1015 Thanks !! it worked :)

I have another question.

I split the LAION-aesthetic V2 5+ dataset into several subsets, e.g., 5M, 10M, 89M, etc, and I made metadata.csv for each subset.

Then, when I tried to train with multi-gpus using the subset dataset, I faced the below error.

I guess that the problem was caused by the data itself.

FYI, I didn't pre-process the data except for resolution (512x512) when I downloaded data.

Did you also face this problem?

Or did you conduct any pre-processing of the LAION data??

Steps: 0%| | 283/400000 [35:52<813:24:06, 7.33s/it, kd_feat_loss=58.6, kd_output_loss=0.0447, lr=5e-5, sd_loss=0.185, step_loss=58.9] Traceback (most recent call last): File "/home/user01/bk-sdm/src/kd_train_text_to_image.py, line 1171, in main() File "/home/user01/bk-sdm/src/kd_train_text_to_image.py", line 961, in main for step, batch in enumerate(train_dataloader): File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/accelerate/data_loader.py", line 388, in iter next_batch = next(dataloader_iter) File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in next data = self._next_data() File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 56, in fetch data = self.dataset.getitems(possibly_batched_index) File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2715, in getitems return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)] File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2715, in return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)] File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2715, in return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)] IndexError: index 63 is out of bounds for dimension 0 with size 63

bokyeong1015 commented 1 year ago

@youngwanLEE We would like to handle this as a separate discussion due to a different topic and for making it easy for other people to find in the future. Could you kindly continue the discussion on that link?

youngwanLEE commented 1 year ago

I resolved this issue(refer to #32)