NKI-AI / ahcore

Ahcore is the AI for Oncology core computational pathology toolkit
Apache License 2.0
15 stars 1 forks source link

feat/on-the-fly inference #87

Open YoniSchirris opened 3 months ago

YoniSchirris commented 3 months ago

Fixes #73.

The commit contains some minor comments that need quick fixing.

This PR implements generating an in-memory database on-the-fly.

This is a useful feature if you want to, e.g., run inference using a segmentation model on a set of slides that you do not with to generate a complete database for.

That is exactly the use-case that it is designed for; running inference of a segmentation model on a glob of slides from a directory, taking only the slide as input (no masks, annotations, labels, patient information).

To achieve this, I have

Possible limitations

YoniSchirris commented 3 months ago

in practice i've noticed that populating the db can take 10 minutes for 1500 wsis, which happens when initializing the datamodule, which initializes the datamanger, which immediately populates teh DB

the biggest problem here was opening each slide with dlup to extract the mpp, width, and height, which is completely irrelevant for our task here. the image.mpp was only used in the overwrite_mpp when constructing a dataset, which is also not interesting because if the image has no mpp, there's nothing to overwrite it with.

for now i've made the minimal image even more minimal; it only contains the fp to the slide.

we may also want to think about when to populate the db, which may be a bad design choice to do during datamodule initialization.

Honestly, we can even forego the entire DB generation, and just within datasets_from_on_the_fly_data_description do for image in data_description.image_dir.glob(data_description.glob_pattern): and do the rest.

no database models, no engine, no session required.

If we do awnt to keep the database, because it might add some more functionality later (e.g. when doing feature extraction w/ a mask?) we may want to open it, populate it, and close it, all during the call of datasets_from_on_the_fly_data_description,