Closed LianShuaiLong closed 2 months ago
Additionally, during multi-GPU training, is it necessary for me to modify the num_workers parameter to speed up data loading? Where can I make this modification? Currently, I am encountering timeout errors when training with a large dataset (approximately 1 million data points).
nope everything is fixed in the way that it needs to be. you didn't provide enough info to answer the question and we don't use distributedsampler, everything is custom made for aspect ratio bucketing.
maybe you are using discovery
backend instead of parquet
for metadata
with >1 million samples you really should be using parquet tables rather than discovery. this has to read every single image in the set. please see the DATALOADER doc for more info on parquet backend.
https://github.com/bghira/SimpleTuner/blob/e4a0adf9326960489c50abdbadb1f45fb1ef92e3/helpers/data_backend/factory.py#L883
init_backend["train_dataloader"] = torch.utils.data.DataLoader( init_backend["train_dataset"], batch_size=1, # The sampler handles batching shuffle=False, # The sampler handles shuffling sampler=init_backend["sampler"], collate_fn=lambda examples: collate_fn(examples), num_workers=0, persistent_workers=False, ) I want to understand why the num_workers parameter is set to 0 by default and the batch_size parameter is set to 1 by default. Additionally, during multi-GPU training, I couldn't find any code related to data sampling with DistributedSampler. Could you please help explain my confusion?