bioimage-io / spec-bioimage-io

Specification for the bioimage.io model description file.
https://bioimage-io.github.io/spec-bioimage-io/
MIT License
18 stars 17 forks source link

Require sensible minimum shapes for models #506

Open k-dominik opened 1 year ago

k-dominik commented 1 year ago

This is a bit of a philosophical topic. And it is a little related to the use in ilastik - I'm not aware of other consumers doing processing tile-wise (maybe QuPath?!).

Currently if we have some minimum shape in the rdf, it doesn't mean that the model will produce anything sensible - see e.g. https://github.com/ilastik/ilastik/issues/2701 . The "minimum" shape we enforced in ilastik (16, 64, 64) is already 2 steps up from the minimum shape described in the model (8, 32, 32) kind-seashell. However the output is far from optimal even with this shape. I know one of the arguments for this in the past was that producing something on a small gpu might be better than nothing. I think that's not very user-friendly, especially targeting users with little to no experience. Since we have the test/example image, can't we enforce that minimum shape will do something usable on that image? The issue is probably caused by block-wise normalization. Implementing per-dataset statistics could be another workaround.

cc: @akreshuk @constantinpape @FynnBe @oeway @esgomezm

FynnBe commented 1 year ago
constantinpape commented 1 year ago

Currently if we have some minimum shape in the rdf, it doesn't mean that the model will produce anything sensible - see e.g. ilastik/ilastik#2701 .

I agree that this is not optimal; and our initial goal of using the parametrized shapes for a dry-run never penned out because none of the consumer softwares have implemented it...

I think that's not very user-friendly, especially targeting users with little to no experience.

I agree. And the initial idea was also that the users with little experience would not need to care because the software automatically chooses a good tile shape via dry-run.

Since we have the test/example image, can't we enforce that minimum shape will do something usable on that image?

This is difficult. The numerical result with tiling will be different from the test result (which is produced without tiling), so it's not clear how one could write a generic test for this.

I'm not aware of other consumers doing processing tile-wise (maybe QuPath?!).

I think deepImageJ also always processes tile-wise, but to my knowledge it uses the shape of the test image to initialize the tile shape. This could also be a workaround you could use in ilastik for now.

Overall, I think the clean solution would be the dry-run functionality. Maybe it would be better to implement it in the core libraries (python and eventually also java), so that every consumer software could use it.

k-dominik commented 1 year ago

Overall, I think the clean solution would be the dry-run functionality. Maybe it would be better to implement it in the core libraries (python and eventually also java), so that every consumer software could use it.

Okay, still - say you have clogged your gpu memory for some reason and dry-run determines some small-ish shape that gives rubbish results - is that good? Or would it be better to have an actual minimum shape that gives "usable" results - such that if it doesn't fit to GPU one can switch to CPU?! I just think other than the model producer, there's basically noone we could get that information from.

k-dominik commented 1 year ago

cool, is this already reflected in the spec somehow - are there models that use this?

FynnBe commented 1 year ago

after some discussion with @k-dominik I think we these issues:

  1. We currently cannot validate if the halo is sufficient.
  2. Wrong per-sample normalization. Due to missing implementation of per-dataset normalization, we decided to wrongly specify per-sample normalization for many models. I think we are better off specifying it as per-dataset normalization and warning users in software that doesn't implement this and falls back to "per-sample" normalization (or however else the software wants to treat this case)
  3. Validating 2.

and these poposed solutions:

  1. test tiling (hear me out 🙏 ): a) for fixed shape: test input is one tile -> create tiling test input by padding (mirror padding as default? add field how to pad?) b) for parametrized shape: Enforce that the test input is at least 2 minimal tiles + halo big do tiling with the minimal/fixed shape and specified halo. add field describing the error tolerance (absolute/relative error and maybe % pixels that may be ignored, if needed?) -> with sufficiently large halo a reasonable tiling error should be possible. And if it is unreasonable high, then it is at least documented that the error is so big when using tiling
  2. We update models that should rather use "per-dataset" normalization. Software can just warn and use "per-sample", but then there is at least a warning
  3. Throw validation error if minimal shape + halo is small (maybe compared to test input) and per-sample normalization is chosen.

We would still benefit from some form of dry-run to get better estimates for tiling, but that would mostly be a performance issue, not primarily a quality one. Having it in core makes sense. We might be able to reuse some lines from tiktorch.

constantinpape commented 1 year ago

What you outline sounds good @FynnBe.

(One minor comment: it would be nice to make model upload easier / more automatic, if this requires changing things in the model. But that shouldn't keep us from improving things ofc.)

FynnBe commented 1 year ago

...if this requires changing things in the model

the "per-sample"->"per-dataset" change could be overwritten in the collection only. no reupload needed -- it's just metadata.

... make model upload easier / more automatic

We (led by @oeway ) plan to improve model upload anyway though. With dedicated cloud space to upload a draft that can be validated directly and iterated over. Only when full validation passes we would publish to Zenodo. That's the rough plan.