bghira / SimpleTuner

A general fine-tuning kit geared toward diffusion models.
GNU Affero General Public License v3.0
1.78k stars 168 forks source link

why num_workers=0 for default #947

Closed LianShuaiLong closed 2 months ago

LianShuaiLong commented 2 months ago

https://github.com/bghira/SimpleTuner/blob/e4a0adf9326960489c50abdbadb1f45fb1ef92e3/helpers/data_backend/factory.py#L883

init_backend["train_dataloader"] = torch.utils.data.DataLoader( init_backend["train_dataset"], batch_size=1, # The sampler handles batching shuffle=False, # The sampler handles shuffling sampler=init_backend["sampler"], collate_fn=lambda examples: collate_fn(examples), num_workers=0, persistent_workers=False, ) I want to understand why the num_workers parameter is set to 0 by default and the batch_size parameter is set to 1 by default. Additionally, during multi-GPU training, I couldn't find any code related to data sampling with DistributedSampler. Could you please help explain my confusion?

LianShuaiLong commented 2 months ago

Additionally, during multi-GPU training, is it necessary for me to modify the num_workers parameter to speed up data loading? Where can I make this modification? Currently, I am encountering timeout errors when training with a large dataset (approximately 1 million data points).

bghira commented 2 months ago

nope everything is fixed in the way that it needs to be. you didn't provide enough info to answer the question and we don't use distributedsampler, everything is custom made for aspect ratio bucketing.

bghira commented 2 months ago

maybe you are using discovery backend instead of parquet for metadata

with >1 million samples you really should be using parquet tables rather than discovery. this has to read every single image in the set. please see the DATALOADER doc for more info on parquet backend.