How to support num workers > batch_size ?

etienne87 / pytorch-stream-dataloader

MIT License

48 stars 3 forks source link

Hi @kaixinbear. The original intent of the repository was to stream from uninterrupted stream readers (e.g hard to seek-to video, online data, etc.) , so by definition 1 process = 1 or multiple streams. The use-case where 1 stream is handled by several processes in parallel (so num_workers can be >> batch_size) is possible if you can seek inside your stream. In that case you can use the original dataloader of pytorch (you just need to map batch_index to file position and use this in the __getitem__method). However i am not totally convinced this is so much slower, can you provide an example ?

etienne87 / pytorch-stream-dataloader

How to support num workers > batch_size ? #10