ScalableCytometryImageProcessing / SCIP

Scalable Cytometry Image Processing (SCIP) is an open-source tool that implements an image processing pipeline on top of Dask, a distributed computing framework written in Python. SCIP performs projection, illumination correction, image segmentation and masking, and feature extraction.
https://scalable-cytometry-image-processing.readthedocs.io/en/latest/
GNU General Public License v3.0
7 stars 0 forks source link

Smart indexing on Dask Dataframe #20

Closed MaximLippeveld closed 2 years ago

MaximLippeveld commented 3 years ago

Setting the index on our dask dataframe can be interesting for us. For instance, samples from different patients could be collected on many timepoints during follow-up (eg blood test every week). Setting the dataframe index to this timepoint column allows us to quickly select timepoints for downstream analysis.

Setting the index also repartitions the data. If the index is set to patient id, for instance, we can compute analyes for all data per patient using map_partition

The index column should be set by the user in a config setting.

MaximLippeveld commented 2 years ago

This proposal is not generic enough. We will not be doing much downstream analysis within SCIP anyways.

Setting the index is now also deferred to the final collection phase for better runtimes.