ScalableCytometryImageProcessing / SCIP

Scalable Cytometry Image Processing (SCIP) is an open-source tool that implements an image processing pipeline on top of Dask, a distributed computing framework written in Python. SCIP performs projection, illumination correction, image segmentation and masking, and feature extraction.
https://scalable-cytometry-image-processing.readthedocs.io/en/latest/
GNU General Public License v3.0
7 stars 0 forks source link

Option to persist feature dataframes to disk (data format yet to be decided. Likely, SQLite and feather) #21

Closed MaximLippeveld closed 3 years ago

MaximLippeveld commented 3 years ago

Calling to_parquet or to_csv on Dask DataFrame creates a file per partition. This is good for intermediate checkpointing of the pipeline state, but not for exporting the data. The data should be exported to one large file. If the features dataframe fits in a single nodes' memory, we can just collect the dataframe to pandas and export from there. If it doesn't fit, we have to append to the output file batch per batch. Need to look into how this works.