PolusAI / bfio

Interface to the Bioformats Java library
MIT License
11 stars 8 forks source link

Requirements versus PyImageJ, python-bioformats, AICSImageIO #19

Closed ctrueden closed 2 years ago

ctrueden commented 2 years ago

The Bio-Formats documentation lists three ways to interface Bio-Formats from Python. See here:

https://docs.openmicroscopy.org/bio-formats/6.8.0/developers/python-dev.html

Those three ways are:

The AICSImageIO and scyjava approaches use JPype. The python-bioformats approach uses python-javabridge, but the COBA team (Broad Imaging Platform & LOCI/Eliceiri lab) is currently investigating either retiring python-bioformats or migrating it to be built on scyjava also.

My question is: what were the motivations behind the development of bfio versus adopting one of these other existing projects?

Totally understandable if it was done naively, or before these other projects were mature enough. That happens all the time in tech of course. But maybe now could be a good time to step back and double check project requirements, to evaluate whether it might be possible to reconcile these overlapping libraries?

Nicholas-Schaub commented 2 years ago

Hey @ctrueden

This is going to be a wall of text because there is a lot of history here. I'm happy to set up a quick meeting to discuss if you're available/willing. In summary, bfio predates the existing functionality in AICSImageIO, predates PyImageJ, and was originally created as a wrapper around python-bioformats to circumvent/modify a number of the issues present in that package. We have discussed merging with AICSImageIO, and what is likely going to happen is that we are going to merge some of our code into their codebase, but we are going to maintain bfio separately for a variety of reasons I'll discuss below.

bfio was originally created as a wrapper around python-bioformats in early 2019 (existing as a folder in polus-plugins) to support our data science teams create interoperable plugins for an NIH funded cloud computing project called Polus, which includes the NIST open source project WIPP. WIPP uses .ome.tif as it's standard file format, specifically using tiled chunks to improve performance. At the time, python-bioformats did not handle chunked reading/writing of data well, and ome model metadata handling was not sufficient for what we needed. We attempted to use tifffile, which was excellent for reading and more performant, but we had similar issues saving data at that time. AICSImageIO did not have bioformats as a part of the source at that time, and had limited functionality relative to where it's at now.

Over time, the biggest bottleneck for bfio was slow read/write times to/from .ome.tif, and we were considering creating a new zarr based file format. What we ended up doing is rewriting the codebase for bfio, which is when the code was migrated from polus-plugins to here. Part of that was creation of an optimized ome.tif class that modified code from tifffile to improve read/write times and also allow us to maintain metadata properly. When we did that, we also dissociated from python-bioformats and created our own bioformats bindings through jpype, since we found a large number of issues working with various file formats through python-bioformats. Essentially, python-bioformats had custom code for class selection that caused issues for a number of supported formats. What bfio does now is use bioformats as a catch-all, and we have custom ome tiff and ome zarr classes for reading/writing of data. If we do not have a custom implementation for a specific format, we try to load a file with bioformats. We explicitly try to avoid using Bioformats as a general rule, because we have found the overhead to be pretty high relative to our dedicated readers and writers.

We have been in discussions with the AICSImageIO team, and even considered merging this work with their repo. We are still going to create bindings for our OME Tiff reader/writer classes because they seem to be more performant than what they currently have. However, we have decided to continue maintaining bfio for a variety of reasons. One of which is that bfio 2.X is relatively polymorphic with numpy, and treats images as memory mapped arrays (something that AICSImageIO also does). This makes adaptation of an existing image processing algorithm easier because we can just replace a numpy array with a bfio object. Also, there is relatively little boilerplate code required to use bfio, lowering the burden on our developers to implement new algorithms.

So, in summary, we found python-bioformats to be a problematic implementation of bioformats, AICSImageIO was not where it is now when we first started (or we may have used that), and PyImageJ did not exist. If you take a look at our polus-plugins repo, you will actually see we have a large set of projects involved in scraping ImageJ Ops using PyImageJ, so we have pretty good knowledge of what it is capable of doing and are closely watching that (particularly the macro functionality that has an issue with the legacy window manager). We have kept a close eye on AICSImageIO and have been in contact with that team, and plan on contributing. However, at the end of the day, the developer usability of our package and the optimized tools specifically for formats in WIPP is what is holding us back from jumping completely on board.

My 2 cents is that python-bioformats should go in the wastebasket if they are considering what to focus on. Not because I am lacking in appreciation of what they accomplished at all, because at the time we first started working on bfio that looked like the best approach from our end. I just find jpype to be a superior and more natural interface to Java, and I think any Python implementation using python-javabridge is going to carry a lot of technical debt. We have also been following scyjava, but I think any tool that directly interacts with bioformats is going to have subpar performance. The data we receive from NIH is generally large, and single images frequently are too large to fit into memory.

We have discussed bfio with the Bioformats team at some ngff meetings. I know they are aware of our work, and I think you will find that some of the issues here are from their team. I think it's a good suggestion to submit a PR to their repo. We have done relatively little advertising about bfio, but we will be changing that soon.

Nicholas-Schaub commented 2 years ago

I should also mention that one thing we tested when creating our own ome tiff reader/writer functions was the reference C++ implementation for OME Tiff. We both found it problematic to build, and when we did get it working it ended up being slower than our Python implementation. A discussion with the maintainer of that project suggested that there was a lot of overhead from Bioformats, even in the C++ reference implementation. So, we have a strong preference for sticking with our Python code until we find something with similar performance. The improvements in performance were more than a factor of 2, and that was before we switched compression libraries which gave us an additional bump in performance. Last we checked, the C++ reference implementation ran about 2x faster than Bioformats, and our Python implementation operates 10-20x faster than Bioformats. This is also one reason we will contribute our code to AICSImageIO, since that has a larger user base and we believe it might get more use and visibility there.

Nicholas-Schaub commented 2 years ago

@ctrueden I and my team would be interested in being a part of discussions for whatever ImageJ/Bioformats have planned moving forward. I think some kind of Bioformats integration will always be a part of bfio, so we are interested in contributing and/or providing input on whatever that might look like.

ctrueden commented 2 years ago

@Nicholas-Schaub Thank you so much for all the detailed explanations and history! It helps a lot to have context surrounding a project, where it came from, and where it is going. :smile:

I'm just going to reply to a few things you said in areas where additional discussion may be fruitful. My goal is just to get a reply down on e-paper before letting this sit for too long, and I hope I don't come off as terse or critical—I'm very supportive of your efforts!

My 2 cents is that python-bioformats should go in the wastebasket

Yeah, agreed on all points—it was a great project at the time, but JPype is better now. That's why the COBA team is doing some exploratory work to replace it with scyjava inside CellProfiler. It should make it a lot easier to integrate additional Java-based technologies, particularly ImageJ2, into CellProfiler without hassle. (Right now, python-javabridge wants to start a JVM, and JPype wants to start a JVM, and they clash, so we are using multiprocess shenanigans to work around it, and it's complex and error-prone across platforms.)

We have also been following scyjava, but I think any tool that directly interacts with bioformats is going to have subpar performance.

This statement I don't understand. Just in case it's not clear: the scyjava project is distinct from SciJava—sorry for the confusing naming there. When I say scyjava I'm talking about the Python project which is extremely general. In a nutshell, this project is simply jpype + jgo, i.e. JPype plus Maven capabilities, so that you can easily load remote artifacts at runtime. The PyImageJ project is built on top of scyjava, but you can use scyjava even if you don't care about the ImageJ/ImageJ2/Fiji/SCIFIO/ImgLib2 alphabet soup. You could use it to talk to Bio-Formats in the same way you are using JPype now, because it is still using JPype. The only difference is that you wouldn't need to ship your own JAR files—you can just reference the Bio-Formats JARs via their groupId:artifactId:version coordinates, and it will take care of the rest.

About "subpar performance"—if you are talking about the OME-TIFF read/write times compared to your super-fast implementation, then sure. I assume you also have that problem when you call BF with JPype the way you are doing now, no? I'm not suggesting to replace anything optimized that you did, far from it. Just suggesting that you might find benefit from scyjava due to its jgo-related part.

I and my team would be interested in being a part of discussions for whatever ImageJ/Bioformats have planned moving forward.

Cool! There are really two separate teams involved here:

our Python implementation operates 10-20x faster than Bioformats

Hmm, I/O is I/O, and language choice doesn't affect the performance too much as far as I'm aware, so this suggests to me that there is a lot more optimization that could be done on the Bio-Formats side. Maybe the Bio-Formats developers could take some inspiration from what you did in Python. But I totally get why you wouldn't need to drive that forward yourself—your project is in Python and now you have a fast pure-Python solution, so there's not much incentive for you to worry about other language codebases!

Ahh, there is probably more goodness to discuss, but I have to run for the moment. If I missed answering any questions you asked above, just ping back and I'll respond again! 😅

Nicholas-Schaub commented 2 years ago

So...we have a pretty intimate understanding of scyjava because of how it is implemented for pyimagej. If I remember correctly, we had to do a bit of hackery to get bfio and scyjava to play well together, which required browsing the internals a bit. My comment about scyjava and bioformats alluded to the exact point that you made: that it would be used to interface with bioformats, and that presents a performance issue. Not because of scyjava, but because bioformats has performance issues that we experience in our own jpype implementation. This comment was made primarily in the context of "would we get rid of bfio to use/contribute to an alternative solution". I think we would, but bfio would wrap whatever that other interface with Bioformats would be under the hood. I'd much rather use another projects full implementation of Bioformats than have to do the grunt work completely on our own, which is why we would be interested in commenting/contributing on whatever that solution might be.

I would love to do a video call to hash things out. We have been turning ImageJ ops into plugins for our cloud computing framework, and because this covers several hundred ops, we have run into a number of issues with the way pyimagej does conversions and have had to create custom solutions using imagelyb. I have one full time developer working only on that stuff, and I'm sure there are things we could contribute back to pyimagej. One of our missions is to support open source projects like pyimagej.

I'll join the pyimagej gitter and will get my fulltime developer to join as well. Maybe we can coordinate a meeting with whoever is on that team on there.

Nicholas-Schaub commented 2 years ago

Our approach on this is going to be the following:

  1. Switch from bioformats-jar to scyjava
  2. Continue to maintain our custom readers/writers.