Open YoniSchirris opened 5 months ago
I thought about this a bit more:
AbstractFileWriterCallback._dataset_sizes
is only used internally to track size of datasets and being in the last batch or notAbstractFileWriterCallback._dataset_sizes
keys are curnetly being set using teh slide_identifier
, which requires opening the slide, which is slowslide_identifier
in a SlideImage
is the filepath, if it is not explicitly set.AbstractFileWriterCallback
, this is even entirely expected, e.g. on line 302, we see if self._tile_counter[curr_filename] == self._dataset_sizes[curr_filename]:....
_path
from the dataset
class to set the _dataset_sizes
keys, which will be faster en not lose any unctionality.If, in the future, we want to support identifier WITHIN this class, this can be considered a feature request that requires some more refactoring.
[2024-06-12 12:00:19,387][ahcore.data.dataset.DlupDataModule][INFO] - Dataset for stage predict has 773079 samples and the following statistics:
- Mean: 485.30
- Std: 145.56
- Min: 48.00
- Max: 1056.00
[2024-06-12 12:00:19,393][ahcore.callbacks.abstract_writer_callback][DEBUG] - Prediction epoch start
[2024-06-12 12:00:19,416][ahcore.callbacks.converters.common][INFO] - Starting worker for TiffConverterCallback
[2024-06-12 12:00:19,432][ahcore.callbacks.converters.common][INFO] - Starting worker for TiffConverterCallback
[2024-06-12 12:00:19,442][ahcore.callbacks.converters.common][DEBUG] - Workers started.
[2024-06-12 12:00:19,447][ahcore.callbacks.converters.common][INFO] - Starting worker for TiffConverterCallback
this fixes this slowness as seen above. Whenever the dataset is loaded, the tiffwriter is immediately ready to go and inference starts
Describe the bug When running inference,
AbstractWriterCallback
loops over all datasets to construct the_dataset_size
dict. This opens a slide from cache several times, which can take 1-3 seconds. For a dataset of 1500 wsis this often takes 20 minutes.To Reproduce Run inference on-the-fly (#87) with your
data_dir
andglob_pattern
set up to find many whole-slide images.Expected behavior You'll find that after printing the dataset statistics, it takes a long time to start setting up callback workers.
In my case
Environment dlup version: 0.3.38 How installed: unsure Python version: 3.11.9 Operating System: linux
Quick solution to reduce time by half; in https://github.com/NKI-AI/ahcore/blob/93274e5ed0859813011b81979367189a0b80a932/ahcore/callbacks/abstract_writer_callback.py#L181 change
to
which will likely reduce the time by half