DIPjavaio to handle files with multiple images

crisluengo commented 7 months ago

In Bio-Formats, reader.getSeriesCount() will return the number of images in the file, and reader.setSeries(i) will configure the reader to start reading the image number i.

We could add a parameter to dip::ImageReadJavaIO():

FileInformation ImageReadJavaIO(
      Image& out,
      String const& filename,
      String const& interface = bioformatsInterface,
      dip::uint imageNumber = 0
);

to indicate which image to read, and with a default value of 0. It being at the end is ugly but won't break code that currently uses this function.

dip::ImageReadTIFF() actually has a dip::Range parameter for the image number, and will concatenate all the images read. I'm not sure this is useful in the generic case of ImageReadJavaIO, which can deal with so many different file types. I'd rather write a new function that populates a std::vector of images. Right now I'm dealing with a CIF file that contains 50k tiny images, several Gb all together, I don't think it's a good idea to try to read that in one go. But on the other hand, some of these multi-image file formats don't have an index that points to each image, and the reader has to pass through all images before the one you want to read (see for example TIFF). So calling a reader function for each image is terrible. For these cases, you really want to initialize the reader, and return an iterator over images. But then the API starts to become quite complex...

Oh, OK, setSeries() should be fast. If so, opening the file could be slow? It's probably still more efficient to read many images in one go than calling dip::ImageReadJavaIO() for each image in a large file.

wcaarls commented 7 months ago

I assume calling dip::ImageReadJavaIO() 50k times has significant overhead, but you can test it. The easiest, API-wise, is to do the same as TIFF, adding an extra dimension to the returned image and forcing all read image series to have the same dimensionality. If this is undesired, returning an std::vector is an option. We can call it dip::ImageReadJavaIOArray(). In both cases, accepting a dip::Range parameter.

Can you expand on you preference for returning a vector? Do you expect to read images with different dimensionalities/properties in other formats, but not in TIFF?

crisluengo commented 7 months ago

TIFF can also have images of different sizes. But I've never seen a TIFF file with 50k images in it. Usually they're related, they're either slices of a 3D image, or they're scaled versions of the same image (a pyramid), etc. We can handle these cases well with the current TIFF reading code in DIPlib.

This CIF file has 50k images of a single cell each. Each image has 6 channels and is tiny (~80 pixels square), but they're all cut to a different size depending on the size of the cell. bfconvert spent almost two hours this morning extracting images and writing a TIFF file for each. That's obviously not doable.

BTW, I didn't post this issue to have you do the work, I just wanted to record my thoughts and hopefully get some good suggestions.

crisluengo commented 7 months ago

@wcaarls Take a look at what I did so far: 28d2fd34eaa509949eff744d1a3a45633a144a1b

wcaarls commented 7 months ago

If the images are different sizes, you would indeed need to return a vector somehow. The code looks good! How is the overhead of reading 50k images like that?

If the overhead is too high, another option (although perhaps not the best one) is to introduce state to the interface, where you first open the image, then read however many images you want, and then close it.

crisluengo commented 7 months ago

This is indeed quite slow. Opening this file takes a second or so every time.

I think reading in a series of images as a vector, specified through a Range argument, will be the simplest solution. I really like the idea of a stateful reader (Bio-Formats itself is designed that way too) but that would probably be a much bigger effort to get right.

I'm thinking there's two options:

To BioFormatsInterface add functions Open, SetSeries, ReadSeries, and Close. The current Read would call Open, ReadSeries and Close in succession. The new C++ function to read multiple images could call these functions individually, and put the images into a std::vector.
To BioFormatsInterface add a function ReadMany, which reads images into some Java array object. The C++ interface would convert this Java array of images into a std::vector of images.

The issue with option 2 is that we'll be limited by the Java memory. In option 1, Java only reads one image at the time, so it won't be overwhelmed.

I have no idea which option is easier to implement... And I don't know if option 1 is most of the way towards the stateful reader?

wcaarls commented 6 months ago

I prefer option 1, which indeed is most of the way there to a stateful reader. The difference (and perhaps advantage) is that the user does not see the statefulness. To implement it, the reader object would become a member of the BioFormatsInterface class, such that it can be reused.

crisluengo commented 6 months ago

Another advantage would be that we can easily implement a dip::ImageReadJavaIOInfo().

DIPlib / diplib

DIPjavaio to handle files with multiple images #156