data loading problem with 89M pairs

youngwanLEE commented 1 year ago

Hi, thanks to your excellent work, I have conducted many experiments.

When I trained on a subset of LAION-aesthetic-5+ (about 89M pairs), my training process was killed without specific error message:(

Maybe it occurred at the load_dataset.

I guess that the number of training sets is too big, but I'm not sure.

I think this problem may be caused by the huggingface's dataset library.

Have you ever faced this problem? and have you tried to train your model on much bigger training set?

Thanks in advance :)

ThibaultCastells commented 1 year ago

Hello, thanks for utilizing our work 😊

~I have a few questions to better understand your issue:~

~did you try to train with a smaller dataset, to make sure that the issue is caused by the dataset size?~
~could you try to finetune the original Stable Diffusion Unet with this same dataset, using the Hugging Face train_text_to_image.py script, and let us know if you encounter the same issue? This would help identifying where the issue comes from.~
~with how many GPU are you training? If more than one, could you try again with one GPU? Our models were trained with a single GPU, so if you are using more it may be related to this.~

edit: sorry for misunderstanding the situation, I’ve checked your discussions [1 2]. We will get back to you soon.

bokyeong1015 commented 1 year ago

@youngwanLEE Thanks for sharing your update. Happy to know you are working with large-scale data :)

We haven't worked with a dataset as large as the one you're considering (for clarity, we used 0.22M or 2.3M pairs from LAION-Aesthetics V2).

We haven't encountered the errors you mentioned, sudden killed processes during data load.

We faced some preprocessing issues from problematic image-text pairs (e.g., empty text files or PIL-unreadable images), but these always resulted in error messages.

Sorry for being unable to provide a clear opinion, because we haven't experimented with such large data using multi-GPU training; however, your point (“may be caused by the huggingface's dataset library.”) seems reasonable and may be due to multi-gpu loading for a huge dataset [1] [2-korean].

One suggestion would be to report this issue at https://github.com/huggingface/datasets. It would be very appreciated if you could generously share your update and/or solution to this issue.

youngwanLEE commented 1 year ago

@bokyeong1015 Thanks for reply :)

BTW, I wonder when your 2M dataset loading #15 issue will finish.

bokyeong1015 commented 1 year ago

@youngwanLEE Thank you for your inquiry :)

The 2.3M dataset is now downloadable, and please check this link if you are interested!

youngwanLEE commented 1 year ago

@bokyeong1015 Thanks!

When I tried to download the data, an error occurred:

--2023-09-03 08:36:39-- https://netspresso-research-code-release.s3.us-east-2.amazonaws.com/data/improved_aesthetics_6.5plus/preprocessed_2256k.tar.gz Resolving netspresso-research-code-release.s3.us-east-2.amazonaws.com (netspresso-research-code-release.s3.us-east-2.amazonaws.com)... 52.219.94.146, 52.219.108.82, 52.219.100.216, ... Connecting to netspresso-research-code-release.s3.us-east-2.amazonaws.com (netspresso-research-code-release.s3.us-east-2.amazonaws.com)|52.219.94.146|:443... connected. HTTP request sent, awaiting response... 403 Forbidden 2023-09-03 08:36:40 ERROR 403: Forbidden.

It may be caused by the same address as that of 11K or 212K .

bokyeong1015 commented 1 year ago

@youngwanLEE thanks for reaching out.

Based on the log message and as you correctly analyzed ("the same address as that of 11K or 212K"), the URL should be S3_URL="https://netspresso-research-code-release.s3.us-east-2.amazonaws.com/data/improved_aesthetics_6.25plus/preprocessed_2256k.tar.gz"

_6.5plus is wrong, and _6.25plus is correct

Could you kindly try out the above link?

FYI: the dataset details can be found in MODEL_CARD.md

BK-SDM: 212,776 image-text pairs (i.e., 0.22M pairs) from LAION-Aesthetics V2 6.5+.
BK-SDM-2M: 2,256,472 image-text pairs (i.e., 2.3M pairs) from LAION-Aesthetics V2 6.25+.

youngwanLEE commented 1 year ago

@bokyeong1015 Thanks !! it worked :)

I have another question.

I split the LAION-aesthetic V2 5+ dataset into several subsets, e.g., 5M, 10M, 89M, etc, and I made metadata.csv for each subset.

Then, when I tried to train with multi-gpus using the subset dataset, I faced the below error.

I guess that the problem was caused by the data itself.

FYI, I didn't pre-process the data except for resolution (512x512) when I downloaded data.

Did you also face this problem?

Or did you conduct any pre-processing of the LAION data??

Steps: 0%| | 283/400000 [35:52<813:24:06, 7.33s/it, kd_feat_loss=58.6, kd_output_loss=0.0447, lr=5e-5, sd_loss=0.185, step_loss=58.9] Traceback (most recent call last): File "/home/user01/bk-sdm/src/kd_train_text_to_image.py, line 1171, in main() File "/home/user01/bk-sdm/src/kd_train_text_to_image.py", line 961, in main for step, batch in enumerate(train_dataloader): File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/accelerate/data_loader.py", line 388, in iter next_batch = next(dataloader_iter) File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in next data = self._next_data() File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 56, in fetch data = self.dataset.getitems(possibly_batched_index) File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2715, in getitems return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)] File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2715, in return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)] File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2715, in return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)] IndexError: index 63 is out of bounds for dimension 0 with size 63

bokyeong1015 commented 1 year ago

@youngwanLEE We would like to handle this as a separate discussion due to a different topic and for making it easy for other people to find in the future. Could you kindly continue the discussion on that link?

youngwanLEE commented 1 year ago

I resolved this issue(refer to #32)

Nota-NetsPresso / BK-SDM

data loading problem with 89M pairs #29