angelolab / ark-analysis

Integrated pipeline for multiplexed image analysis
https://ark-analysis.readthedocs.io/en/latest/
MIT License
69 stars 25 forks source link

ValueError: pixel_som_clustering.generate_som_avg_files #1111

Closed dcomimmo closed 3 months ago

dcomimmo commented 5 months ago

Hey there,

I've been trying to run pixie locally on 37 fovs, with 22 channels but I get a ValueError when calling the pixel_som_clustering.generate_som_avg_files function. It seems that it expects at least 100 clusters while my data generates some 96 clusters. Is there a way to change default settings, or a work around here to write the average expression file?

Error is raised based on the following version:

Metadata-Version: 2.1
Name: ark-analysis
Version: 0.6.6
Python 3.11.5
/opt/miniconda3/envs/ark_env/lib/python3.11/site-packages/ark/phenotyping/pixel_cluster_utils.py:346: UserWarning: Provided num_fovs_subset=100 but only 37 FOVs in dataset, subsetting just the 37 FOVs
  warnings.warn(
Output exceeds the size limit. Open the full output data in a text editor
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[15], line 13
      2 pixel_som_clustering.cluster_pixels(
      3     fovs,
      4     channels,
   (...)
      9     batch_size=batch_size
     10 )
     12 # generate the SOM cluster summary files
---> 13 pixel_som_clustering.generate_som_avg_files(
     14     fovs,
     15     channels,
     16     base_dir,
     17     pixel_pysom,
     18     data_dir=pixel_data_dir,
     19     pc_chan_avg_som_cluster_name=pc_chan_avg_som_cluster_name
     20 )

File /opt/miniconda3/envs/ark_env/lib/python3.11/site-packages/ark/phenotyping/pixel_som_clustering.py:340, in generate_som_avg_files(fovs, channels, base_dir, pixel_pysom, data_dir, pc_chan_avg_som_cluster_name, num_fovs_subset, seed, overwrite)
    337 # compute average channel expression for each pixel SOM cluster
    338 # and the number of pixels per SOM cluster
    339 print("Computing average channel expression across pixel SOM clusters")
--> 340 pixel_channel_avg_som_cluster = pixel_cluster_utils.compute_pixel_cluster_channel_avg(
    341     fovs,
...
    399     )
    401 # now compute the means using the count column
    402 sum_count_totals[channels] = sum_count_totals[channels].div(sum_count_totals['count'], axis=0)

ValueError: Averaged data contains just 96 clusters out of 100. Average expression file not written. Consider increasing your num_fovs_subset value.
cliu72 commented 4 months ago

Hi @dcomimmo! This is a strange error. pixel_som_clustering.generate_som_avg_files determines the number of clusters it expects by looking at the weights file, which is generated during training (https://github.com/angelolab/ark-analysis/blob/main/src/ark/phenotyping/pixel_som_clustering.py#L356). Usually, if during training, the SOM outputs less than 100 clusters, the weights file should have less than 100 rows (in this case, should be 96) - and therefore this error shouldn't arise. This error suggests that training returned 100 clusters, but upon inference, you're only getting 96 clusters (which is strange). Can you check pixel_som_weights.feather and make sure there are really 100 rows? You can use pd.read_feather to open that file.

To work around this error, you can manually change the number of clusters in the jupyter notebook. The xdim and ydim parameters control the number of clusters in the SOM (the total number of clusters is xdim*ydim, the default is 10 for both, hence the default is 100 clusters). Changing the number of clusters from 100 to 96 shouldn't materially change clustering results.

pixel_pysom = pixel_som_clustering.train_pixel_som(
    fovs,
    channels,
    base_dir,
    subset_dir=pixel_subset_dir,
    norm_vals_name=norm_vals_name,
    som_weights_name=pixel_som_weights_name,
    xdim=12, #ADD THIS
    ydim=8, #ADD THIS
    num_passes=1,
    seed=42
)

The last thing I would suggest trying is pulling the most recent version of the codebase on main and re-install the package using pip install . in the ark-analysis folder. There were some recent changes to the code that may have impacted this.

@alex-l-kong have you seen this error before?

janinemelsen commented 4 months ago

Hi! I am encountering the exact same issue while trying to increase the total number of clusters to 13x13: ValueError: Averaged data contains just 165 clusters out of 169. Average expression file not written. Consider increasing your num_fovs_subset value. The pixel pysom feather file, does contain 169 rows in my case.

Could this be related to issue #1036 ? I will try to pull the most recent version again to see whether it will be solved:)

cliu72 commented 4 months ago

@alex-l-kong It seems it's possible for the SOM training function to return 100 nodes, but then the mapping function to return less than that number during inference. I took a quick glance at the c code in pyFlowSOM and I think it's fine. I think we should change the functionality of pixel_cluster_utils.compute_pixel_cluster_channel_avg. Right now, we are using the dimensions of the weights to determine what the number of clusters should be: https://github.com/angelolab/ark-analysis/blob/main/src/ark/phenotyping/pixel_som_clustering.py#L356 https://github.com/angelolab/ark-analysis/blob/main/src/ark/phenotyping/pixel_cluster_utils.py#L394-L401

This leads to errors when training returns more nodes than are assigned during inference. One solution is to manually keep track of how many total clusters are assigned during inference. Open to your thoughts on other solutions

janinemelsen commented 4 months ago

Just to confirm, in my hands the problem is not solved with the newest version, I consistently get this error now, even when returning only 100 nodes.

cliu72 commented 4 months ago

Thanks @janinemelsen! While we're working on a solution, for a quick fix, you can either decrease the number of nodes, or manually comment out these lines: https://github.com/angelolab/ark-analysis/blob/main/src/ark/phenotyping/pixel_cluster_utils.py#L394-L401 and re-build the package (pip install the local version with this change made). If you're going to do this though, make sure that the num_fovs_subset parameter in pixel_som_clustering.generate_som_avg_files is equal to the number of your total FOVs (the default is 100, so if you have less than 100, all FOVs are used to generate the average file. However, if you have more than 100 FOVs, make sure to set num_fovs_subset to be the number of FOVs - this will make the function take longer, but only then can you ensure that all clusters are represented in the average file).

@alex-l-kong can we work on a fix? Can you think of any recent changes that would cause this to be a problem now and not before?

janinemelsen commented 4 months ago

Hi! I am only including one FOV, and after changing the code this error occured:

/exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/ark/phenotyping/cluster_helpers.py:259: UserWarning: Pixel SOM already trained on specified markers
  warnings.warn('Pixel SOM already trained on specified markers')
/exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/ark/phenotyping/pixel_cluster_utils.py:348: UserWarning: Provided num_fovs_subset=100 but only 1 FOVs in dataset, subsetting just the 1 FOVs
  warnings.warn(
Traceback (most recent call last):
  File "/exports/immu-staallab/Janine/pixelcluster.py", line 293, in <module>
    pixel_cc = pixel_meta_clustering.pixel_consensus_cluster(
  File "/exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/ark/phenotyping/pixel_meta_clustering.py", line 140, in pixel_consensus_cluster
    pixel_cc.generate_som_to_meta_map()
  File "/exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/ark/phenotyping/cluster_helpers.py", line 646, in generate_som_to_meta_map
    self.input_data[self.meta_col] = self.cc.predict_data(self.input_data[self.columns])
  File "/exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/ark/phenotyping/cluster_helpers.py", line 566, in predict_data
    return self.cluster_(n_clusters=self.bestK).fit_predict(data)
  File "/exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/sklearn/cluster/_agglomerative.py", line 1124, in fit_predict
    return super().fit_predict(X, y)
  File "/exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/sklearn/base.py", line 791, in fit_predict
    self.fit(X)
  File "/exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/sklearn/base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/sklearn/cluster/_agglomerative.py", line 979, in fit
    return self._fit(X)
  File "/exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/sklearn/cluster/_agglomerative.py", line 1094, in _fit
    self.labels_ = _hc_cut(self.n_clusters_, self.children_, self.n_leaves_)
  File "/exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/sklearn/cluster/_agglomerative.py", line 737, in _hc_cut
    raise ValueError(
ValueError: Cannot extract more clusters than samples: 100 clusters where given for a tree with 93 leaves.
alex-l-kong commented 4 months ago

@cliu72 I'll open a PR for this by tomorrow.

cliu72 commented 4 months ago

@janinemelsen What is max_k set to in the jupyter notebook?

alex-l-kong commented 4 months ago

@janinemelsen I'm still working on testing, but if you'd like to check out a beta version of the PR to address this, you can look at #1124. The function you're running into issues with should no longer require a fixed number of clusters to be present.

The branch name to check out is min_som_cluster (command is git checkout min_som_cluster).

janinemelsen commented 4 months ago

ahh @cliu72 my max_k was set too high indeed, that was stupid. Thanks! My pixels are now mapping. @alex-l-kong I will check it out.

I really appreciate your help, pixie is awesome:)

cliu72 commented 4 months ago

@janinemelsen No problem! Just for a little more context. max_k controls the number of metaclusters that you want to start with (the actual number of metaclusters is decided using the GUI). max_k should reflect the "true" number of biological phenotypes you would expect in your data (so should be way less than 100). The actual number of SOM clusters is decided by the xdim and ydim parameters in the train_pixel_som function (the number of SOM clusters is xdim*ydim) - these SOM clusters are then metaclustered into max_k number of metaclusters. The Nature Comm paper has more details if you want some more info!

janinemelsen commented 4 months ago

Yes got it! My 'problem' is that I have 167 markers (resulting in a lot different phenotypes), and that a low frequent population often clusters together with other poplulations. So I was thinking, to increase the number of SOM clusters, and then also increase the max_k to simply see whether it would even be possible to cluster those pixels separately.

janinemelsen commented 4 months ago

@alex-l-kong for me its working now:)

janinemelsen commented 4 months ago

Okay maybe not... I do get clustering results (# clusters=100, k=50), but it doesnt make any sense:

image image

Could this be related to the fragmentation error?

/exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/ark/phenotyping/cluster_helpers.py:293: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of callingframe.insertmany times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, usenewframe = frame.copy() external_data_norm['pixel_som_cluster'] = som_labels /exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/ark/phenotyping/pixel_cluster_utils.py:348: UserWarning: Provided num_fovs_subset=100 but only 1 FOVs in dataset, subsetting just the 1 FOVs warnings.warn( /exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/ark/phenotyping/cluster_helpers.py:646: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of callingframe.insertmany times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, usenewframe = frame.copy()``

alex-l-kong commented 4 months ago

@janinemelsen data fragmentation doesn't relate to actual data integrity, so nothing is getting compromised. It's simply a warning that the frame in memory isn't stored in an optimal manner, which can happen if the data is big enough, your computer is on the low end in terms of memory (16 GB), and/or if your computer has a lot of processes running.

I'm more concerned about a couple other things:

  1. It appears only 1 FOV was subsetted during the aggregation step, how many FOVs in total do you have? Results generally can't be conclusive if the SOM is trained on a single FOV.
  2. You set k=50 metaclusters, which seems too high. The data I've worked with only requires 20 or 30. I think you may want to try something like a 20x20 SOM grid (or 200 SOM clusters) and then 30 metaclusters. @cliu72 open to more suggestions here.
janinemelsen commented 4 months ago

@alex-l-kong I have two FOVs, but the data is really big (macsima data) with 167 markers, so I am now analyzing one by one, just to see whether the clustering looks fine.

The weird thing is, that a few weeks ago, I was able to run this smoothly, with even 55 metaclusters:

image
janinemelsen commented 3 months ago

Problem solved, to speed up the process I decided to train the som on a lower proportion of pixels... and increased the # of parallel pixels,. Apparently that number was set too high.... causing these issues. I am back on track now with a beautiful pixel cluster maks:)

cliu72 commented 3 months ago

Should be solved by https://github.com/angelolab/ark-analysis/pull/1124