angelolab / ark-analysis

Integrated pipeline for multiplexed image analysis
https://ark-analysis.readthedocs.io/en/latest/
MIT License
70 stars 25 forks source link

Make sure overwrite functionality for pixel clustering propagates into helper functions #1058

Closed alex-l-kong closed 11 months ago

alex-l-kong commented 11 months ago

What is the purpose of this PR?

The overwrite flag wasn't correctly dropping the "pixel_som_cluster" column, which will cause problems during assignment since it will attempt to use that column as an actual feature when mapping onto the SOM.

How did you implement your changes

Pass the overwrite flag into run_pixel_som_assignment.

alex-l-kong commented 11 months ago

Looks good.

Out of curiosity though, where was this causing an error? From a quick glance at the code, I couldn't figure out where the error would come up.

In assign_som_clusters (https://github.com/angelolab/ark-analysis/blob/main/src/ark/phenotyping/cluster_helpers.py#L260-L278), it looks like only the column names that were in the normalization/weights file would be used. Like the functions normalize_data only uses columns from the normalization values, and generate_som_clusters only uses columns from the weights. In generate_som_cluseters:

            cluster_labels.append(map_data_to_nodes(
                self.weights.values.astype(np.float64),
                external_data.loc[
                    i:min(i + 99, external_data.shape[0]), weights_cols
                ].values.astype(np.float64)

Because of the weight_cols, it seems to me that even if the 'pixel_som_cluster' column was still in the data file, it would be ignored.

You're right that leaving in the SOM cluster column shouldn't inherently cause problems. Still not not entirely sure why this was an issue with Patricia's run in particular, it's nice to have as an extra check.

After taking another look at the process it seems there's another problem: when the user needs to overwrite the data in pixel_mat_data, it shouldn't go through an additional round of normalization each time, which is what's actually happening at the moment. We'll need to propagate this further into the actual PixelSOMCluster and PixieSOMCluster classes so we don't renormalize if the overwrite flag is set to True.

alex-l-kong commented 11 months ago

Good catch! Perhaps I'm losing it, but I think that the logic is a bit off.

If overwrite=True, then we do not need to re-normalize, so normalize_data in assign_som_clusters should be False right? Right now, since we're passing normalize_data=overwrite, normalize_data would be True.

Good catch, got a bit lost in the horrors of control flow logic negation.