Scalable Cytometry Image Processing (SCIP) is an open-source tool that implements an image processing pipeline on top of Dask, a distributed computing framework written in Python. SCIP performs projection, illumination correction, image segmentation and masking, and feature extraction.
Calling to_parquet or to_csv on Dask DataFrame creates a file per partition. This is good for intermediate checkpointing of the pipeline state, but not for exporting the data. The data should be exported to one large file. If the features dataframe fits in a single nodes' memory, we can just collect the dataframe to pandas and export from there. If it doesn't fit, we have to append to the output file batch per batch. Need to look into how this works.
Calling to_parquet or to_csv on Dask DataFrame creates a file per partition. This is good for intermediate checkpointing of the pipeline state, but not for exporting the data. The data should be exported to one large file. If the features dataframe fits in a single nodes' memory, we can just collect the dataframe to pandas and export from there. If it doesn't fit, we have to append to the output file batch per batch. Need to look into how this works.