Test multi-plates support

tcompa commented 2 years ago

Upon parsing of the raw images we are currently identifying all the plates which are available in a given input folder. In current examples there is only one plate, but in principle there could be more. This means that several steps have to be plate-dependent: for instance we need to specify lists of channels/wells/sites for each plate. Later on, during workflow execution, each task should read the lists of channels/wells/sites for the correct plate.

Question: is this really needed? Is there a multi-plate usecase where we really want fractal to handle all of them at the same time?

If there is not a strong motivation for keeping this feature, we propose a simplification where create_zarr_structure only handles a single plate. A simple (optional) argument can be used to select a specific plate, if needed (that is, when there are more plates in the same folder). If this argument is not provided, and there are more than one plates, create_zarr_structure would raise an exception and fail.

Note that in this way we would loose the automatic parallelizazion over plates, but this seems a very minor issue since we would be mostly parallelizing things across wells (or even sites, if possible).

jluethi commented 2 years ago

That's a good discussion point. I see the additional complexity and we probably don't want to test many datasets that are multi-plate (even bigger test sets). But we will eventually have multi-plate datasets for bigger screens that are running.

I don't see multi-plate support as our critical use-case for building initial workflows, but I think we should keep that support. If it's helpful, we can create a small multi-plate test set for it, so we know this workflow doesn't break.

What do you think @gusqgm ? In the Pelkmans lab, multi-plate experiments are more of an exception than a norm, but they happen. The Liberalis may have more multi-plate workflows already? And especially given the goals of the platform, it should handle this.

Now regarding the implementation: Will the system handle the case where there are multiple plates, thus multiple zarr files at the moment? Your suggestion for a simplified implementation with a multi-plate flag would make sense, we may anyway need to handle inputs slightly differently there. E.g. often multi-plate images may be stored in separate folders. And I don't think we'd have many jobs that need to parallelize over plates, as long as jobs can parallelize over the wells within multiple plates.

gusqgm commented 2 years ago

Hey @tcompa and @jluethi , I think that having a small multi-plate test data for keeping track that this does not break is important. Multi-plate acquisition is done quite often in the lab, either in the shape of a screening experiment, multiplexing, or even for the sake of control plates and so this is something that Fractal will definitely need to be able to address.

Regarding the implementation, I would agree that having one zarr file per plate is sound. Would be useful however to add somewhere information to each of these files so that a user has direct overview of which acquisition (plate) has been processed and what is there to be visualized from the data.

tcompa commented 2 years ago

Quick use-case question:

When loading several plates from the same folder, is it safe to assume that the same channels are present for all plates? Or should this check be plate-dependent?

jluethi commented 2 years ago

When loading several plates, they will be coming from different folders. And then be parsed to different Zarr files.

Good question whether we can assume that every plate has the same channels. I'd say that's a reasonable assumption. If a user wants to process multiple plates that have different channels, they can just make multiple Fractal projects. The idea of a Fractal project with a given workflow can only really work for multiple plates if those plates have the same channels. Otherwise, how does the pipeline handle settings in tasks like channel selections in illumination correction or channel selection in segmentation when those channels only exist for a subset?

=> Tl;dr: Yes, we can build it like that. We can add in a check that throws an error otherwise

tcompa commented 2 years ago

And then be parsed to different Zarr files.

Agreed, each plate goes to a different zarr.

When loading several plates, they will be coming from different folders.

I'm confused. The scope of this issue is whether we should be able to load different plates from the same folder (first sentence in first comment: Upon parsing of the raw images we are currently identifying all the plates which are available in a given input folder). This is the feature which is already present in create_zarr_structure (although it needs a bit of cleaning up), and my question was about removing this feature.

If images for different plates live in different folders, we need to change a bit the create_zarr_structure, such that it receives a list of input folders (rather than just one folder), and then this change needs to correctly propagate through tasks (I think it is already working in the current version, but let's double check it). No problem with this change, of course, but let's clarify it.

EDIT: Of course discussing the multiplate support is more interesting than knowing where the images come from, so we should clarify the minor issue of image source first and then keep the discussion open about multiplate testing ;)

jluethi commented 2 years ago

Thanks for flagging this. I didn't read that carefully. I'm not aware of scenarios where there are files of multiple plates in the same folder. The common scenario I am aware of consists of multiple folders, each corresponding to a plate. @gusqgm Are you aware of multi-plate data being in a single folder? Otherwise, that part of the parsing could indeed be removed.

gusqgm commented 2 years ago

I have been checking around with some people in the lab. Even if the user uses a multiplate reader (a.k.a. that robot that has the barcode reader), in the end we have a unique folder pointing to one unique plate folder, and it seems that it is never the case that the Yokogawa software saves all of the plates within the same folder.

However, it important to stress that this discussion is hover around the isolated case of Yokogawa, i.e. 1 microscope setup. I so far do not have enough intel for the other systems around. That said, it might help us in the future to be able to handle both.

If anyways the users specify the folder paths where the images are to be processed, then it should not matter whether these image folders (aka plates) are living within the same folder of not, correct?

jluethi commented 2 years ago

If anyways the users specify the folder paths where the images are to be processed, then it should not matter whether these image folders (aka plates) are living within the same folder of not, correct?

We should design the system as such that the arrangement of multiple folders with respect to each other doesn't matter to the system. It will receive a single path pointing to the folder containing the images for each plate. I'd say we start with the "plates aren't in the same folder" use case, as it covers all our known Yokogawa setups. If future microscopes have different file structures, we can worry about parsing them once we have example data for them we need to process :)

gusqgm commented 2 years ago

Ok, let me rephrase here, as I got lost in the definition of plate (which is being defined here as a collection of images and metadata for a particular experiment), sorry for that.

So far in no recording with any of our current microscopes do we end up with a single folder containing many different collections of images+meta for many recording rounds. Each plate data lives in a unique plate folder, which has a path the user can point to. Worst case scenario here would be that many plate folders would live within the same experiment folder, in which case either the user would manually point to each one of the plate folders as datasets manually, or Fractal would be able to figure out by looking inside the folder above and seeing a list of valid plate folders inside.

I hope this is clear now!

tcompa commented 2 years ago

I think we now agree, but let me also clarify it from our side. For us a plate is the unique ID of whatever is before the well, in the image filenames - parsed as in https://github.com/fractal-analytics-platform/mwe_fractal/issues/48#issuecomment-1134563060.

All images in /whatever/path/210305NAR005AAN_210416_164828_*.tif belong to plate 210305NAR005AAN and go to 210305NAR005AAN.zarr;
All images in /whatever/path/220304_172545_220304_175557_*.tif belong to plate RS220304172545 and go to RS220304172545.zarr;
All images in /whatever/path/20200812-CardiomyocyteDifferentiation14-Cycle1_*.png belong to plate 20200812-CardiomyocyteDifferentiation14-Cycle1 and go to 20200812-CardiomyocyteDifferentiation14-Cycle1.zarr.

The current plan (not so complex, but let's make it explicit) is that create_zarr_structure can receive a list of (say) three folders, and within each folder there will be images corresponding to a single plate. Then it will create three zarr files. If the three plates living in three different folders have the same ID, we would need to add some additional suffix to the zarr filenames (is this a relevant use-case?).

Worst case scenario here would be that many plate folders would live within the same experiment folder, in which case either the user would manually point to each one of the plate folders as datasets manually, or Fractal would be able to figure out by looking inside the folder above and seeing a list of valid plate folders inside.

The first behavior (manually providing a list of folders) should be covered by what I just suggested, and in that case the relative path of folders (e.g. whether they are all in the same folder) becomes irrelevant. The second behavior (the input is a root folder with several subfolders, each one corresponding to a plate) can also be implemented quite easily, but I'd propose that we currently support only one of the two.

tcompa commented 2 years ago

Unrelated questions:

Do we ever need to treat multiple acquisitions in the same plate? At the moment we are not supporting this, and we always have a single acquisition.
Is the id value in omero metadata (https://ngff.openmicroscopy.org/latest/#omero-md) the same id value as in plate metadata (https://ngff.openmicroscopy.org/latest/#plate-md)? If not, what is the "OMERO id"? Is it always equal to 1, for us?

jluethi commented 2 years ago

If the three plates living in three different folders have the same ID, we would need to add some additional suffix to the zarr filenames (is this a relevant use-case?).

I don't think it's an case that should come up, because that would potentially also be confusing for the user in the first place. Workaround for the moment: Add a suffix (e.g. _1, _2) if that happens. If we ever hit a case where this should do something different, we can worry about it then :)

The first behavior (manually providing a list of folders) should be covered by what I just suggested, and in that case the relative path of folders (e.g. whether they are all in the same folder) becomes irrelevant.

Let's implement this behavior first. That is the very generalizable case that does not depend on naming conventions of folder hierarchies etc.

Do we ever need to treat multiple acquisitions in the same plate? At the moment we are not supporting this, and we always have a single acquisition.

Yes! This is what we call "multiplexing" when we acquire multiple so called "cycles" or acquisitions. In praxis, this will mean the user has a folder per acquisition. We would want to parse them into the same plate, as channels that are named differently.

Is the id value in omero metadata (https://ngff.openmicroscopy.org/latest/#omero-md) the same id value as in plate metadata (https://ngff.openmicroscopy.org/latest/#plate-md)? If not, what is the "OMERO id"? Is it always equal to 1, for us?

Omero IDs would become relevant if we decide to use an Omero server for image storage (see https://github.com/fractal-analytics-platform/mwe_fractal/issues/72). My understanding is that this ID would then relate to some OMERO server information. For the time being, that can always be set to 1 and we can change that should we actually use an OMERO server backend.

tcompa commented 2 years ago

If the three plates living in three different folders have the same ID, we would need to add some additional suffix to the zarr filenames (is this a relevant use-case?).

I don't think it's an case that should come up, because that would potentially also be confusing for the user in the first place. Workaround for the moment: Add a suffix (e.g. _1, _2) if that happens. If we ever hit a case where this should do something different, we can worry about it then :)

Done in https://github.com/fractal-analytics-platform/mwe_fractal/commit/b3fd38ec5aba96404a8badcb10d0237b4735ebc0.

The first behavior (manually providing a list of folders) should be covered by what I just suggested, and in that case the relative path of folders (e.g. whether they are all in the same folder) becomes irrelevant.

Let's implement this behavior first. That is the very generalizable case that does not depend on naming conventions of folder hierarchies etc.

Partially done in https://github.com/fractal-analytics-platform/mwe_fractal/commit/b3fd38ec5aba96404a8badcb10d0237b4735ebc0: the multi-folder feature is correctly implemented in create_zarr_structure but not in fractal_cmd.py, as we need to understand the best way to transfer some information (namely the path of raw images) from the first to the latter.

Do we ever need to treat multiple acquisitions in the same plate? At the moment we are not supporting this, and we always have a single acquisition.

Yes! This is what we call "multiplexing" when we acquire multiple so called "cycles" or acquisitions. In praxis, this will mean the user has a folder per acquisition. We would want to parse them into the same plate, as channels that are named differently.

Multiplexing is not yet supported in any part of Fractal, but let's discuss it in a different issue when needed.

jluethi commented 2 years ago

I create a small test dataset here: /data/active/fractal/3D/PelkmansLab/CardiacMultiplexing/Multiplate_2x2_singleWell

It contains 2 folders: plate1 plate2

Each folder contains a single well, 2x2 sites (the data is originally from the same plate, we're just pretending it's from separate plates. The metadata has been faked to be correct for both plates and is in each plate folder if we want to test with metadata parsing later as well)

tcompa commented 2 years ago

Thanks @jluethi. Could you also spell out the intended behavior of Fractal on this dataset?

My guess is that we aim at having a single zarr files with a number of channels equal to num_ch_plate1 + num_ch_plate2, and that the channels of the two plates will simply constitute the channels of the single zarr folder. Am I on the right track? Is there something more subtle?

jluethi commented 2 years ago

Great question. My expectation here would actually be that each plate is saved as its own OME-Zarr file, as the OME-Zarr HCS spec describes plates.

That would cover the "multi-plate" case, if it actually concerns multiple distinct physical plates.

Another topic is "multiplexing", i.e. multiple folders of images that belong to the same physical plate. There, the behavior you describe above (=> all channels are separate, but belong to the same OME-Zarr file) should apply. We can discuss this case further in this milestone. I will create a respective issue & also create a test dataset for this.

jluethi commented 6 months ago

Our converters are not designed for multi-plate handling. The rest of Fractal shouldn't care much, as tasks that run per image don't know what plate the image is on. Developing converters that load multiple plates can be a new issue when it comes up

fractal-analytics-platform / fractal-tasks-core

Test multi-plates support #49