bioimage-io / core-bioimage-io-python

Python libraries for loading, running and packaging bioimage.io models
https://bioimage-io.github.io/core-bioimage-io-python/
MIT License
28 stars 21 forks source link

Validate model output with precision relative to the values of the data #416

Closed melisande-c closed 1 month ago

melisande-c commented 1 month ago

Problem

Some datasets have large values, so when validating that an output matches the given expected output, it doesn't always make sense to use decimal precision as a metric. Say for example a dataset has values on the order 1e5 then the output could match the expected output to 5 significant figures but still fail the validation test, even if the decimals are checked only up to the first position past the decimal point.

Solution

It would be desirable to be able to specify something like the number of significant figures or some kind of relative tolerance to validate the outputs with.

FynnBe commented 1 month ago

This is implemented with numpy.testing.assert_almost_equal, called in https://github.com/bioimage-io/core-bioimage-io-python/blob/9e1e1fe6c7a916411c644590b91ec57d1ed06c18/bioimageio/core/_resource_tests.py#L152

So the actual test is evaluating abs(desired-actual) < float64(1.5 * 10**(-decimal))

The argument name decimal (taken from the numpy function used) can be confusing though, thanks for raising this issue! The testing behavior should be documented!

TODO document decimal argument in:

melisande-c commented 1 month ago

Hi @FynnBe,

Thanks for the quick response, however this still has the problem that the tolerance in difference is not relative the the size of the values of the data. Say the output is 12345.6 and the expected output is 12345.7, the difference is 0.1 and so would fail the test for decimals=1, even though by a lot of metrics the values are pretty close.

These differences can arise for a few different reasons, but one situation that I have noticed small differences in outputs is using different batch sizes to process the predictions in PyTorch. I think the core tests the outputs with both a batch size of 1 & 2?

This problem has been encountered a few times when trying to export CAREamics models. There is a convenience function export_to_bmz that allows users to export their models with limited arguments, and I am considering implementing the calculation of the decimal precision depending on the size of the input data values, but if this could be handled in the core side that would be preferable.

Thanks, Melisande

FynnBe commented 1 month ago

oh of course you're right! yes, I suppose the cleanest would be to allow to specify an absolute and a relative tolerance like xarray implements it: https://docs.xarray.dev/en/stable/generated/xarray.testing.assert_allclose.html

I propose to

This is maybe a good time to consider including these (atol and rtol) in the official spec (instead of having them in the config field).

melisande-c commented 1 month ago

yes, I suppose the cleanest would be to allow to specify an absolute and a relative tolerance like xarray implements it: https://docs.xarray.dev/en/stable/generated/xarray.testing.assert_allclose.html

Just had a quick look and numpy also has testing.assert_allclose with atol and rtol arguments: https://numpy.org/doc/stable/reference/generated/numpy.testing.assert_allclose.html

FynnBe commented 1 month ago

PR welcome ;-)

melisande-c commented 1 month ago

I would make a PR but do you need a deprecation warning for decimal?

FynnBe commented 1 month ago

Yes, we cannot have exceptions raised for a suddenly missing decimal argument. a simple change of the default value to decimal: Optional[int] = None and logging a warning a laif decimal is not None: logger.warning("deprecated in favor of atol") will do.