DIAGNijmegen / pathology-whole-slide-data

A package for working with whole-slide data including a fast batch iterator that can be used to train deep learning models.
https://diagnijmegen.github.io/pathology-whole-slide-data/
Apache License 2.0
86 stars 24 forks source link

Advantage of ConcurrentBuffer against standard pytorch data loader #38

Closed CharlieCheckpt closed 1 year ago

CharlieCheckpt commented 1 year ago

Dear author,

Thank you for this nice repo.

I am a bit confused regarding the use of ConcurrentBuffer, that as the README states "uses shared memory and allows for loading patches quickly via multiple workers".

What is the advantage of using such method over a pytorch dataloader that reads tile in parallel with multiple workers ?

martvanrijthoven commented 1 year ago

Dear CharlieCheckpt

Thank you for your interest in this repository and for asking a very insightful question!

The ConcurrentBuffer system that we've implemented is different from the PyTorch Dataloader due to its sampling strategy.

With a PyTorch DataLoader, while it can utilize multiple workers to load data in parallel, the overall sampling strategy is somewhat disjointed because each worker is sampling independently without a global view (as far as I know).

The ConcurrentBuffer, on the other hand, is based on a commander-producer architecture, providing a more controlled and globally aware sampling approach. The "commander" is aware of the overall status and coordinates the "producers" (workers) to load data in a manner that follows a specific strategy. This controlled approach to sampling can be advantageous in certain situations. For instance, in cases where you want to ensure all annotations from a specific label have been used to sample patches before resampling from these annotations again or if you need to implement more complex sampling strategies that require knowledge of global sampling statistics.

That said, ConcurrentBuffer might not always be necessary or beneficial. For simple applications where independent sampling is sufficient, a PyTorch DataLoader could be more suitable. The choice between the two depends on the specifics of your use case.

I hope this clarifies your confusion. Please feel free to ask if you have any further questions.

Best wishes, Mart

CharlieCheckpt commented 1 year ago

Very clear, thank you for this detailed answer @martvanrijthoven !