Open crisluengo opened 7 months ago
I assume calling dip::ImageReadJavaIO()
50k times has significant overhead, but you can test it. The easiest, API-wise, is to do the same as TIFF, adding an extra dimension to the returned image and forcing all read image series to have the same dimensionality. If this is undesired, returning an std::vector
is an option. We can call it dip::ImageReadJavaIOArray()
. In both cases, accepting a dip::Range
parameter.
Can you expand on you preference for returning a vector? Do you expect to read images with different dimensionalities/properties in other formats, but not in TIFF?
TIFF can also have images of different sizes. But I've never seen a TIFF file with 50k images in it. Usually they're related, they're either slices of a 3D image, or they're scaled versions of the same image (a pyramid), etc. We can handle these cases well with the current TIFF reading code in DIPlib.
This CIF file has 50k images of a single cell each. Each image has 6 channels and is tiny (~80 pixels square), but they're all cut to a different size depending on the size of the cell. bfconvert
spent almost two hours this morning extracting images and writing a TIFF file for each. That's obviously not doable.
BTW, I didn't post this issue to have you do the work, I just wanted to record my thoughts and hopefully get some good suggestions.
@wcaarls Take a look at what I did so far: 28d2fd34eaa509949eff744d1a3a45633a144a1b
If the images are different sizes, you would indeed need to return a vector somehow. The code looks good! How is the overhead of reading 50k images like that?
If the overhead is too high, another option (although perhaps not the best one) is to introduce state to the interface, where you first open the image, then read however many images you want, and then close it.
This is indeed quite slow. Opening this file takes a second or so every time.
I think reading in a series of images as a vector, specified through a Range
argument, will be the simplest solution. I really like the idea of a stateful reader (Bio-Formats itself is designed that way too) but that would probably be a much bigger effort to get right.
I'm thinking there's two options:
BioFormatsInterface
add functions Open
, SetSeries
, ReadSeries
, and Close
. The current Read
would call Open
, ReadSeries
and Close
in succession. The new C++ function to read multiple images could call these functions individually, and put the images into a std::vector
.BioFormatsInterface
add a function ReadMany
, which reads images into some Java array object. The C++ interface would convert this Java array of images into a std::vector
of images.The issue with option 2 is that we'll be limited by the Java memory. In option 1, Java only reads one image at the time, so it won't be overwhelmed.
I have no idea which option is easier to implement... And I don't know if option 1 is most of the way towards the stateful reader?
I prefer option 1, which indeed is most of the way there to a stateful reader. The difference (and perhaps advantage) is that the user does not see the statefulness. To implement it, the reader
object would become a member of the BioFormatsInterface
class, such that it can be reused.
Another advantage would be that we can easily implement a dip::ImageReadJavaIOInfo()
.
In Bio-Formats,
reader.getSeriesCount()
will return the number of images in the file, andreader.setSeries(i)
will configure the reader to start reading the image numberi
.We could add a parameter to
dip::ImageReadJavaIO()
:to indicate which image to read, and with a default value of 0. It being at the end is ugly but won't break code that currently uses this function.
dip::ImageReadTIFF()
actually has adip::Range
parameter for the image number, and will concatenate all the images read. I'm not sure this is useful in the generic case ofImageReadJavaIO
, which can deal with so many different file types. I'd rather write a new function that populates astd::vector
of images. Right now I'm dealing with a CIF file that contains 50k tiny images, several Gb all together, I don't think it's a good idea to try to read that in one go. But on the other hand, some of these multi-image file formats don't have an index that points to each image, and the reader has to pass through all images before the one you want to read (see for example TIFF). So calling a reader function for each image is terrible. For these cases, you really want to initialize the reader, and return an iterator over images. But then the API starts to become quite complex...Oh, OK,
setSeries()
should be fast. If so, opening the file could be slow? It's probably still more efficient to read many images in one go than callingdip::ImageReadJavaIO()
for each image in a large file.