CellProfiler / python-bioformats

Read and write life sciences file formats
Other
125 stars 45 forks source link

Improve image reader selection function #157

Open DavidStirling opened 2 years ago

DavidStirling commented 2 years ago

Fixes #129, Fixes CellProfiler/CellProfiler#3411

This PR addresses a long-standing issue with python-bioformats reading file metadata incorrectly, particularly when inspecting OME-TIF files. Within CellProfiler this manifested as the Metadata module "seeing" all frames within a multidimensional image as Timepoints instead of C, Z and T series. Example files can be found here for testing.

The core issue was that python-bioformats used a custom strategy in get_image_reader to attempt to find the correct reader for a supplied image file. This involved testing filenames against the list of available reader classes over a series of passes aimed at finding the best match. The key objective there was to avoid needing to have bioformats open the files and inspect the header to determine whether said reader was the correct choice, instead basing things on the file extension if possible.

However, the reader selection implementation in bioformats has evolved substantially over the years. Today the OME-TIF reader (for example) will never be selected at all if performing selection in extension-only mode. Extension-only matching is actually now also available as an option within the reader, so the javascript implementation from python-bioformats is somewhat redundant. Furthermore, allowing bioformats to open files for inspection is no longer associated with the same performance cost that it once was. In my testing allowing file inspection resulted in CellProfiler getting the correct reader and metadata without any significant slowdown.

With this in mind, I've revised the reader selection function to use the native bioformats selector, with the option to work in extension-only mode parameterised as the new allow_open_image argument in get_image_reader. I've had this default to True to ensure that the correct reader is selected by default.

In a seperate PR we should add a CellProfiler setting to revert back to the old functionality, which would basically pass allow_open_image=False into reader requests. This would deliver the same results as the current release so that anyone who wrote their pipeline to handle the incorrect metadata can still use those workflows.