angelolab / ark-analysis

Integrated pipeline for multiplexed image analysis
https://ark-analysis.readthedocs.io/en/latest/
MIT License
71 stars 25 forks source link

Save `cluster_counts_size_norm` as it is being updated #992

Closed HPiyadasa closed 1 year ago

HPiyadasa commented 1 year ago

This is candace on Hadeesha's account again. @cliu72

Is your feature request related to a problem? Please describe. After the Pixie refactoring, the code was changed such that cluster_counts_size_norm isn't saved to a feather file until the very end of the notebook. This is a problem when users with large datasets are not able to run the entire notebook in one sitting. Because cluster_counts_size_norm isn't saved, there is no way to map cells to cell clusters without going through the entire notebook.

Describe the solution you'd like Save cluster_counts_size_norm to the feather file after clustering and metaclustering are done (and before the end of the notebook).

alex-l-kong commented 1 year ago

@cliu72 we should add better fallback recovery for this to the cell clustering notebook as well. Here's one way we could do it:

  1. If cluster_counts_size_norm_path already exists, load the file at the beginning instead of calling create_c2pc_data again. Same with weighted_cell_channel_path.
  2. In cluster_cells, if cell_pysom.cell_data already has a cell_som_cluster column attached, return the data as is (and don't re-run the clustering step as we currently do). Add an explicit feather.write_dataframe command at the end of this cell.
  3. Similar to step 2 but for cell_consensus_cluster and the cell_meta_cluster column. Add an explicit feather.write_dataframe command at the end of this cell.

Let me know if we're missing anything.

cliu72 commented 1 year ago

@alex-l-kong This looks good to me!

cliu72 commented 1 year ago

@alex-l-kong Oh actually one thought - I think it'd be good to add an explicit feather.write_dataframe after create_c2pc_data for the size normalized data (below where we already have that for the unnormalized data). I think in Hadeesha's experience, create_c2p2_data can take awhile, so it'd be good to have that file saved before cluster_cells. And then we can overwrite the file at the end of cluster_cells.