BodenmillerGroup / gc3apps

Repository of all GC3Pie applications and utilities for bblab
GNU General Public License v3.0
0 stars 0 forks source link

`gcp_pipelines` does not work with pipelines using grouping #4

Open votti opened 5 years ago

votti commented 5 years ago

Problem:

Pipelines can have grouping, that requires images to be processed together. This will be reflected in the group.json file that can be printed from cellprofiler via the --print-groups command. There each ImageGroup is an entry in the list. Each ImageGroup has in turn a list of the associated ImageSets: example with 2 groups: 1: Image 1, 2: Image 2&3: ->

[[{"Metadata_date": "20180527", "Metadata_plate": "102"}, [1]], [{"Metadata_date": "20180531", "Metadata_plate": "104"}, [2, 3]]]

(This example is equivalent to the example_grouping example from the test-data).

Currently, the code only checks for the number of entries in this Json and assumes that the number of groups = number of images (https://github.com/BodenmillerGroup/gc3apps/blob/2c92a8d36cef49e389692d5182c8ca066f1cbaf4/gc3apps/gcp_pipeline.py#L117)

Thus for this example, the gcp_pipeline would now assume the pipeline would only contain 2 images and process images 1-2. When parallelizing the job is split up into processing images: 1-1 and 2-2.

The correct behaviour would be: Without parallelization: process images 1-3 With parallelization: make maximally 2 batches, process images 1-1 and 2-3.

sparkvilla commented 5 years ago

I just modified the code and tested against example_csv and example_grouping. It should be ok now for both of them. You can find the results under /mnt/output/20190306_csv_par and /mnt/output/20190306_grouping_par. I will commit and wait for @smaffiol to push