Pot de-randomisation stage

chuong commented 10 years ago

The situation is I have a CVS file that provided plants and their replicated with pot indexes (PotID) across left and right chambers. I have to group pot images of the same plant into a single column. For now I will write this process as a separate script but we have to think about a suitable way to integrate this into the pipeline.

Below is an extract from the CVS. The format may change in the future.

EcotypeID_473K,EcotypeID_250K,Name,Replication,Condition,Chamber,Tray,PotID,snp,0=non-Col-0 Allele,1=Col-0 Allele 91,91,jea,1,Coastal,5,03C3,53,C,1, 91,91,jea,2,Coastal,5,11C3,213,C,1, 2290,2290,ste-3,1,Coastal,5,06B2,107,C,1, 2290,2290,ste-3,2,Coastal,5,14B2,267,C,1, 5832,5832,app1-16,1,Coastal,5,01B3,8,G,0, 5832,5832,app1-16,2,Coastal,5,09B3,168,G,0, 5837,5837,bor_1,1,Coastal,5,01C5,15,C,1, 5837,5837,bor_1,2,Coastal,5,09C5,175,C,1, 6008,6008,Duk,1,Coastal,5,01C1,11,C,1, 6008,6008,Duk,2,Coastal,5,10C1,191,C,1,

Joelgranados commented 10 years ago

The question of where to put this is an interesting one. I encountered a similar problem with the "output features to csv" pipeline component. I think the best thing to do here is to modify the pipeline class and add a way to define things that get executed once all the pipeline is finished. We can put the derandomizer and the feature calculations there. What do you think?

Joelgranados commented 10 years ago

Mail shared offline with chuong:

One problem when putting such processing inside pipeline is that it involves with two input timestreams (left and right chamber) simultaneously and at the same time stamp.
The replicates of a plant are generally not in the same chamber.

One thing we can do is to create a pipeline stage that partially perform de-randomisation wherever possible. 
Then join such de-randomisation information using another program.
However I find this a bit clumsy.  The de-randomisation should not be part of the current image processing pipeline. Instead it is more like visualisation step and data management.

Joelgranados commented 10 years ago

I think we should still put this in a pipeline component. We need to think of a way to configure the execution in such a way that it can receive 2 (or more?) input streams. Here is how it might be executed: pipeline_demo.py --derandomize-config=derandomize.yml inputstream1 inputstream2...

1) In pipeline demo we have 2 types of executions a) "normal" with just one timestream and b) "multiple" with multiple timestreams. We might want to add more types as we move forward.

2) The yaml configuration can define what individual pipeline is run on each timestream. I'm guessing that you would need to run the tray and pot detectors on each pipeline before doing the derandomization. These two (tray and pot detectors) should look for the saved pot positions in the _data file to avoid calculating it twice.

3) the de-randomization component: We can still code this with the pipeline components. Just create a component that expects two timestreams and works with them to de-randomize and will be called from the "multiple" execution type.

4) the derandomization component should know what to do and where to put what. And everythin should be configured from the yaml file.

What do you think? IMO, this is good, not only for the derandomization, but for other purposes, like statistic (area, perimeter...) calculation.

chuong commented 10 years ago

Hi Joel,

The problem is where to execute derandomisation and how it interacts with the rest of the processing pipeline. I believe we should not combine derandomisation into the same running level as other operations in the current processing pipeline because:

Synchronisation between input timestream instances is challenging. We already have parallel processing and multiple output streams.
Tight it into the same level as other operations does not help with image processing or statistic calculation. Furthermore different plants should be processed indiscriminately.
De-randomisation is a fast operation to group data in certain order if running independently. Keep it independent from processing pipeline allows us to easily add more visualisation features for different needs without affecting the processing pipeline upstream.

I believe we need to create a higher level pipeline which contains:

Image processing pipeline: undistortion, color correction, tray/pot detection, segmentation, feature extraction. This is the current processing pipeline.
Data analysis pipeline: derandomisation and other data analyses. Derandomisation using ground truth information applies to corrected images, segmented images and corresponding extracted feature data. After derandomisation, extracted data can be further analysed to get desired information. Furthermore, machine learning technique can be used to derandomise so that phenotypes can be identified.
Data visualisation pipeline: plotting and GUI interface.

Joelgranados commented 10 years ago

Some stuff that we discussed in the 05-Aug-2014 meeting (Kevin, Chuong, Joel).

1) This needs to be separate from the general pipeline execution. Because of the interactive nature of the use case. 2) We want to run this after most the timestream has been created. 3) We want an interactive way for the user to show us where the plants are and how to derandomize them. This can be achieved with a GUI that writes the plant ID on top of chamber images and allows the user to change the information. 4) Assuming that all plants have a unique ID. We can derandomize within an experiment but also throughout various experiments

Did I forget anything?

Joelgranados commented 10 years ago

All four points are implemented in https://github.com/Joelgranados/timestreamlib/tree/next

borevitzlab / timestreamlib

Pot de-randomisation stage #27