allenai / rslearn

A tool for developing remote sensing datasets and models.
Apache License 2.0
5 stars 0 forks source link

Slow training with large batch size #72

Open favyen2 opened 2 weeks ago

favyen2 commented 2 weeks ago

pytorch default dataloader has each worker load one batch, so with a large batch size like batch_size=64, every worker is loading 64 windows sequentially, and we have to wait for the first worker to finish an entire batch before the training can start.

Maybe it'd be better to form a batch by combining items across workers, given that loading from GCS can be slow?

Or maybe we should switch to Weka and find a good way to sync between GCS and Weka, that could solve the problem.