etienne87 / pytorch-stream-dataloader

MIT License
48 stars 3 forks source link

How to support num workers > batch_size ? #10

Open kaixinbear opened 1 year ago

kaixinbear commented 1 year ago

Hi, etienne87: I find that this repo would change num_worker to no larger than batch_size.(https://github.com/etienne87/pytorch-stream-dataloader/blob/master/pytorch_stream_dataloader/stream_dataloader.py#L45) However, this would make the dataloader much slower than vannilar dataloader, which would make training time longer. Do you have any suggestion on how to support num_workers just like original dataloader in pytorch ?

etienne87 commented 1 year ago

Hi @kaixinbear. The original intent of the repository was to stream from uninterrupted stream readers (e.g hard to seek-to video, online data, etc.) , so by definition 1 process = 1 or multiple streams. The use-case where 1 stream is handled by several processes in parallel (so num_workers can be >> batch_size) is possible if you can seek inside your stream. In that case you can use the original dataloader of pytorch (you just need to map batch_index to file position and use this in the __getitem__method). However i am not totally convinced this is so much slower, can you provide an example ?