all dataloader worker use the same data

yijicheng commented 6 months ago

I found all dataloader workers use the same data for training

When training, why the alternate chunks are not necessary? https://github.com/dcharatan/pixelsplat/blob/e70a337b2627c003dc0088bb47f095b8da73d65d/src/dataset/dataset_re10k.py#L78

dcharatan commented 6 months ago

During training, the chunks and the examples within each chunk are supposed to be randomly shuffled instead:

https://github.com/dcharatan/pixelsplat/blob/e70a337b2627c003dc0088bb47f095b8da73d65d/src/dataset/dataset_re10k.py#L75-L76

https://github.com/dcharatan/pixelsplat/blob/e70a337b2627c003dc0088bb47f095b8da73d65d/src/dataset/dataset_re10k.py#L96-L97

When I set data_loader.train.num_workers=8 and data_loader.train.batch_size=8, it seems like the chunks are being randomized for me when I inspect batch["scene"] at the beginning of training_step:

batch["scene"] # ['0f93fdb52c6933cf', 'a3a5e373d876db0e', 'c1f32a7a7ec37d39', '524e550992aa6e60', '7300b62195688898', '4f202bd24cdf8dee', 'c495f01f294333ee', '2eacf9db1843546b']

Can you provide more information about how and where you're seeing that the workers aren't being shuffled properly?

yijicheng commented 6 months ago

that the workers aren't being shuffled proper

I add print(torch.distributed.get_rank(), batch["scene"]) at the beginning of training_step, and set batch_size=1 (4gpus):

0 ['0f93fdb52c6933cf'] 1 ['0f93fdb52c6933cf'] 2 ['0f93fdb52c6933cf'] 3 ['0f93fdb52c6933cf']

It means every GPU gets the same scene, is it right for training?

dcharatan commented 6 months ago

It looks like this is a bug during multi-GPU training! Thank you for pointing this out—I hadn't noticed this because we settled on single-GPU training with a batch size of 7. I've pushed a change which ensures that the data loaders are seeded correctly based on the global rank, which should fix the issue.

dcharatan / pixelsplat

all dataloader worker use the same data #41