Handling multiplexing datasets with complex or partial inputs

In a basic implementation of multiplexing that will work for many standard cases, we expect every cycle (round of acquisition, probably each cycle would be a resource in Fractal parlance) to consist of images of the same regions and each acquisition to have the same number of images per channel.

In that case, we can just match FOV 1 in cycle 2 to FOV 1 in cycle 1 etc. => easy matching by field number as long as the data is complete.

But there are some more complicated cases. This issue will try to list the relevant categories and we can discuss here which of those will be handled directly within Fractal or what mitigation strategies could be applied.

Summary: I can think of 3 complex input cases. I think we should discuss how we handle case 1, and declare cases 2 & 3 out of scope. @gusqgm @MaksHess Have I missed any cases? Are there important examples of cases 2 & 3 that we should really be supporting? @tcompa What's your opinion on case 1?

1. The cycles have the same number of field of views, but the indices don't match & there are multiple folders for some cycles Reasons: An acquisition at the microscope (e.g. the new Apricot microscope in the Pelkmans lab) failed and the cycle now consists of 2 rounds of acquisitions, both belonging to the same cycle. (a simpler sub-case may be: single folder per cycle, but FOV numbers don't match. But I can't think of a setting where this would arise) Test case: By @MaksHess , being prepared Mitigation strategies: a) The user needs to prepare the data to match expectations. That means: Creating a new folder with the combined images (e.g. linking images into the new folder using ln) and renaming them to fit the numbering system of cycle 1. Also, the user provides a pandas dataframe with the relevant metadata that fits the renamed images (see https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/14) b) Partial adaptation by the user, but Fractal does half the work: The user needs to provide a single input folder (e.g. using the strategy above). That probably means renaming the (linked/copied) files, because otherwise multiple sites are F001. The user also provides a metadata table fitting the renamed files. But the user would not have to worry about matching the field of view numbers, because Fractal takes care of this (by matching ROI locations of Cycle 1 with ROI locations of later cycles and mapping Cycle 2 FOV X to Cycle 1 FOV Y). c) Fractal can handle this natively: Going further with the logic in b, Fractal also takes 1 to n folders as the input for each cycle, can parse the metadata file for each input folder, combine the metadata (safely) and then do the same matching as in b to match site numbers.

Personally, I'd prefer if Fractal only needs to handle a, because it keeps the tasks & input structures simpler. We could write some example scripts or jupyter notebooks to handle this. @tcompa Based on the current dataset structure, how feasible is it to go with route c? I'd be mostly worried about how we handle different resources belonging to the same cycle. But given that we're still defining resources, maybe having 1-n resources per cycle wouldn't actually be a big deal?

2. Not the same number of FOVs per cycle Reason: A user changed experimental design in the middle of their multiplexing or the microscope crashed leading to a partial acquisition. My first reaction would be that we don't support this use case (at least at this stage). Mitigation: A user could create blank image data if just a few images are missing (this is how we handled such cases in TissueMaps) & provide the metadata table like above in b. => this can be worked around on the user side Alternatively, a user may continue acquiring the missing FOVs after a microscope crash, turning this into case 1 again.

3. FOVs do not have the same locations Reason: Field of views may have small shifts I don't understand yet how this case arises. Can you elaborate @MaksHess ? My first reaction is that we would not support this. I also don't see an easy way to work around this from a user side.

An alternative approach to tackle 2 & 3 would be for Fractal to be ultimately general here, where each cycle has its own ROI definition (and potentially its own coordinate space). In that setting, different cycles could have varying numbers of FOVs and the positions of those FOVs could also vary. Would make parsing a bit complicated, because we don't know in the first cycle how big the dask array for the well eventually needs to be. And would make downstream processing very complicated, because any physical location may have actual image data from 0 to n different channels.

Thanks for summarizing @jluethi! Here are some thoughts on the individual points, also I've added an example that should cover our use case (individual re-imaged sites with inconsistent filenames) to /data/active/fractal/3D/PelkmansLab/ZebrafishMultiplexingAdvanced. The code on how I process multiplexing experiments one cycle at a time is in abbott/Maks/deoploy/compress_cv8k_cycle_cluster.py

1. Inconsistent field names & re-imaged wells The problem arises because of issues we currently have with the microscope (i. e. some embryos are not imaged properly and need to be re-imaged). The workaround we use is deleting all the wells (&| individual embryos) from the .mes file and run the acquisition again. The resulting files can not be matched based on filenames, so I mitigated the problem by indexing the positions by (well, x-y-position). This only works if the same .mes file is used, i. e. one can not run search first a second time but I thinks that's a reasonable constraint on a multiplexing experiment. Although I'm hoping that Yokogawa will be able to mitigate this specific issue at some point, during my experiment the microscope also crashed, which led to me having to run the second part of the acquisition again (also leading to inconsistent filenames). Due to the way I indexed the sites this issue was also covered.

2. Not the same number of FOVs per cycle Here's an example on how this could come up. If I have 90% embryo retention per cycle and wanna save time on the acquisition of later cycles. So I might delete individual sites or whole wells during an experiment. Or I add QC-cycles on some wells in the end of an experiment. Again, indexing by well & x-y-position can handle all those cases.

3. FOVs do not have the same locations This was an issue Shayan brought up, he seems to have some embryos that move slightly during the experiment, which is why he might want to re-run search first after a couple cycles. Given this would be adding a lot of complexity and we don't yet know how often this happens I think we can safely ignore this for now.

Regarding the mitigation strategies you proposed: I personally have a strong preference for c (i. e. the possibility of having 1-n folders where the matching of sites happens based on x-y-position the metadata found in MeasurementData.mlf). This is what I've implemented for our current experiments and it solves both 1 & 2 which gives the user more flexibility during an experiment while keeping complexity reasonable (only constraint is one search first run an the beginning of an experiment that defines the positions that can be imaged).

Thanks @MaksHess for the feedback!

This only works if the same .mes file is used

For context: The .mes files are microscope files that control which positions are acquired. So if the same .mes files are used, the acquired positions are exact matches in well, x & y position, right? I would also limit the scope to this for the moment.

I don't think such time-savings alone are worth the complexity to a processing architecture for the moment. Yes, parsing could probably handle this fairly well without large workarounds. But how do we know which ROIs have actual data in which cycle? Do we just have 0s when we have no actual data? Do we still make measurements with all 0s? What processing steps assume that they always have the same channels available? => Parsing is simple, downstream handling could get more complicated. I would suggest we get into this as part of how we handle multiplexing registration. We may also have to tackle these questions. If we find a good answer to them, we can then also allow more complex inputs to be parsed. Until then, I'd focus on covering part 1.
Good let's be aware that multiple independent search-first may (eventually) be of interest, but not approach that right now. Depending on what we figure out for the issues in 2, that may also make this easier or harder :)

I personally have a strong preference for c

Whether we can implement this custom parsing depends on two things from my perspective: 1) Architecture complexity: Can we make the more diverse inputs fit our architecture, e.g. having multiple resources per cycle (but still know they belong to the same cycle) [side note: could be the same for having time data from long acquisitions that may come as different folders] => I'll discuss this with @tcompa 2) Time to implement: Depending on the answer to 1), it may be much faster (& more flexible) to implement strategy a) with some notebooks to allow users to preprocess data. But if c) is not too much extra effort, we can look into building this. The actual implementation is less of a Fractal infrastructure job though, as it's fairly custom to a given set of acquisition preferences, metadata structures etc. So we'll have to see where we can fit it in the priorities and whether that's something we implement together (@MaksHess & me) or if it's part of building the architecture flexibility for 1).

Quick note: From an architecture side, it should be feasible to provide multiple inputs per cycle. The task would need to get quite a bit more complex, but it's anyway based on a list of input directories + some metadata.

Conclusion: Let's build the generic multiplexing input first, then see if we can write a special parsing task that would also handle such inputs & do the parsing.

Thank you @jluethi and @MaksHess for the nice discussion points!

I had a tab with already some notes some days ago, but stupid enough could not find it again then, so I restart some updated thoughts here. 1. Are we considering here also the cases where the microscope .mlf file outputs acquisition error for particular FOV's and therefore skips them? This is something that can happen with search-first (SF) acquisitions, and has been observed here from time to time, apparently related with 384 wells plates. The errors in the .mlf file may e.g. be coming from wrong autofofus leading to messages like: 'AF Error' in the line where there should be the next FOV information. In those cases the user might want also to reimage the wells on particular cycles / all of them.

2. One way this can happen when a user images with SF and on a particular cycle the signal of a channel was not strong enough, so search first with low magnification discards this FOV for a second pass. Ultimately the data for different cycles will have different number of FOV, and the only way to match them is by looking at their positions in space. Indeed, the point here mentioned by @jluethi is important and we need to know how to deal with having no recorded FOV for that particular cycle when extracting features. How about just adding camera noise? Drogon usually adds camera noise for all SF discarded FOV's by default, so that the well never has 0's. But in my opinion we could leave the 0's for the generally never acquired FOV's and only add camera noise for the FOV's where only particular cycles did not work. It is possibly not the ultimate solution, though.

3. So far I have not see a case where there are large shifts between cycles.

All in all I think the generic multiplexing input is a logical first step to pursue as well! Eventually we need to be able to handle error messages from the metadata, as this could also aid us in dealing with unimaged FOV's, and potentially in finding better ways to deal with strategies on how to best estimate feature extraction based on the missing information (see fractal-analytics-platform/fractal-tasks-core#38 ).

Thanks for the additional input @gusqgm . Quick feedback

1) The variant of case 1 you're describing is also covered in this issue in further detail, as you mention in the end: https://github.com/fractal-analytics-platform/fractal-tasks-core/issues/38 If we manage to do matching by position, not by FOV index, I think we should have this issue covered. How we handle absence of FOVs in some cycle then leads into issue 2.

2) Yes, the broader question of how to deal with missing data. Running repeated search-first instead of relying on a .mes file that is run for every cycle could also produce varying site numbers. But that likely also goes into the part for 3, where the positions of FOVs do not necessarily match anymore over cycles, right? i.e. one FOV may be at xy position (120, 150) while the one in the next round would be at position (125, 145)? That for me is the very general case of imaging arbitrary parts of a well each cycle and somehow aggregating this. I think that case may be too general to easily tackle and I would start with enforcing that FOV positions match, even if sometimes some FOVs may be missing. That can be achieved by using the same measurement file for multiple cycles, no? Handling arbitrary acquisitions of subsets of wells about which we know nothings (no consistencies between FOVs enforced) becomes very hard and I would not tackle this now.

On the "what do we do with background": Broader discussion to be had. I'm strongly against just putting in random noise for 3 reasons. First, it massively reduced the compression performance. Second, we're pretending that there is information that doesn't actually exist. Third, there is no benefit of having that noise vs. just 0s that I'm aware of (e.g. I tested model performance with & without background subtraction for Organoid segmentation, we don't see any real differences) => Let's open a separate issue if someone wants to discuss this further. Otherwise, I'd default to saving 0s as the default state of the Zarrs.

Let's open specific issues after the refactor of the task API & the additional flexibility of what to run workflows on: https://github.com/fractal-analytics-platform/fractal-server/issues/792

fractal-analytics-platform / fractal-tasks-core

Handling multiplexing datasets with complex or partial inputs #35