Closed dcomimmo closed 3 months ago
Hi @dcomimmo! This is a strange error. pixel_som_clustering.generate_som_avg_files
determines the number of clusters it expects by looking at the weights file, which is generated during training (https://github.com/angelolab/ark-analysis/blob/main/src/ark/phenotyping/pixel_som_clustering.py#L356). Usually, if during training, the SOM outputs less than 100 clusters, the weights file should have less than 100 rows (in this case, should be 96) - and therefore this error shouldn't arise. This error suggests that training returned 100 clusters, but upon inference, you're only getting 96 clusters (which is strange). Can you check pixel_som_weights.feather
and make sure there are really 100 rows? You can use pd.read_feather
to open that file.
To work around this error, you can manually change the number of clusters in the jupyter notebook. The xdim
and ydim
parameters control the number of clusters in the SOM (the total number of clusters is xdim*ydim, the default is 10 for both, hence the default is 100 clusters). Changing the number of clusters from 100 to 96 shouldn't materially change clustering results.
pixel_pysom = pixel_som_clustering.train_pixel_som(
fovs,
channels,
base_dir,
subset_dir=pixel_subset_dir,
norm_vals_name=norm_vals_name,
som_weights_name=pixel_som_weights_name,
xdim=12, #ADD THIS
ydim=8, #ADD THIS
num_passes=1,
seed=42
)
The last thing I would suggest trying is pulling the most recent version of the codebase on main and re-install the package using pip install .
in the ark-analysis folder. There were some recent changes to the code that may have impacted this.
@alex-l-kong have you seen this error before?
Hi! I am encountering the exact same issue while trying to increase the total number of clusters to 13x13: ValueError: Averaged data contains just 165 clusters out of 169. Average expression file not written. Consider increasing your num_fovs_subset value. The pixel pysom feather file, does contain 169 rows in my case.
Could this be related to issue #1036 ? I will try to pull the most recent version again to see whether it will be solved:)
@alex-l-kong It seems it's possible for the SOM training function to return 100 nodes, but then the mapping function to return less than that number during inference. I took a quick glance at the c code in pyFlowSOM and I think it's fine. I think we should change the functionality of pixel_cluster_utils.compute_pixel_cluster_channel_avg
. Right now, we are using the dimensions of the weights to determine what the number of clusters should be:
https://github.com/angelolab/ark-analysis/blob/main/src/ark/phenotyping/pixel_som_clustering.py#L356
https://github.com/angelolab/ark-analysis/blob/main/src/ark/phenotyping/pixel_cluster_utils.py#L394-L401
This leads to errors when training returns more nodes than are assigned during inference. One solution is to manually keep track of how many total clusters are assigned during inference. Open to your thoughts on other solutions
Just to confirm, in my hands the problem is not solved with the newest version, I consistently get this error now, even when returning only 100 nodes.
Thanks @janinemelsen! While we're working on a solution, for a quick fix, you can either decrease the number of nodes, or manually comment out these lines: https://github.com/angelolab/ark-analysis/blob/main/src/ark/phenotyping/pixel_cluster_utils.py#L394-L401 and re-build the package (pip install the local version with this change made). If you're going to do this though, make sure that the num_fovs_subset
parameter in pixel_som_clustering.generate_som_avg_files
is equal to the number of your total FOVs (the default is 100, so if you have less than 100, all FOVs are used to generate the average file. However, if you have more than 100 FOVs, make sure to set num_fovs_subset
to be the number of FOVs - this will make the function take longer, but only then can you ensure that all clusters are represented in the average file).
@alex-l-kong can we work on a fix? Can you think of any recent changes that would cause this to be a problem now and not before?
Hi! I am only including one FOV, and after changing the code this error occured:
/exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/ark/phenotyping/cluster_helpers.py:259: UserWarning: Pixel SOM already trained on specified markers
warnings.warn('Pixel SOM already trained on specified markers')
/exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/ark/phenotyping/pixel_cluster_utils.py:348: UserWarning: Provided num_fovs_subset=100 but only 1 FOVs in dataset, subsetting just the 1 FOVs
warnings.warn(
Traceback (most recent call last):
File "/exports/immu-staallab/Janine/pixelcluster.py", line 293, in <module>
pixel_cc = pixel_meta_clustering.pixel_consensus_cluster(
File "/exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/ark/phenotyping/pixel_meta_clustering.py", line 140, in pixel_consensus_cluster
pixel_cc.generate_som_to_meta_map()
File "/exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/ark/phenotyping/cluster_helpers.py", line 646, in generate_som_to_meta_map
self.input_data[self.meta_col] = self.cc.predict_data(self.input_data[self.columns])
File "/exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/ark/phenotyping/cluster_helpers.py", line 566, in predict_data
return self.cluster_(n_clusters=self.bestK).fit_predict(data)
File "/exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/sklearn/cluster/_agglomerative.py", line 1124, in fit_predict
return super().fit_predict(X, y)
File "/exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/sklearn/base.py", line 791, in fit_predict
self.fit(X)
File "/exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/sklearn/base.py", line 1152, in wrapper
return fit_method(estimator, *args, **kwargs)
File "/exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/sklearn/cluster/_agglomerative.py", line 979, in fit
return self._fit(X)
File "/exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/sklearn/cluster/_agglomerative.py", line 1094, in _fit
self.labels_ = _hc_cut(self.n_clusters_, self.children_, self.n_leaves_)
File "/exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/sklearn/cluster/_agglomerative.py", line 737, in _hc_cut
raise ValueError(
ValueError: Cannot extract more clusters than samples: 100 clusters where given for a tree with 93 leaves.
@cliu72 I'll open a PR for this by tomorrow.
@janinemelsen What is max_k
set to in the jupyter notebook?
@janinemelsen I'm still working on testing, but if you'd like to check out a beta version of the PR to address this, you can look at #1124. The function you're running into issues with should no longer require a fixed number of clusters to be present.
The branch name to check out is min_som_cluster (command is git checkout min_som_cluster
).
ahh @cliu72 my max_k was set too high indeed, that was stupid. Thanks! My pixels are now mapping. @alex-l-kong I will check it out.
I really appreciate your help, pixie is awesome:)
@janinemelsen No problem! Just for a little more context. max_k
controls the number of metaclusters that you want to start with (the actual number of metaclusters is decided using the GUI). max_k
should reflect the "true" number of biological phenotypes you would expect in your data (so should be way less than 100). The actual number of SOM clusters is decided by the xdim
and ydim
parameters in the train_pixel_som
function (the number of SOM clusters is xdim*ydim) - these SOM clusters are then metaclustered into max_k
number of metaclusters. The Nature Comm paper has more details if you want some more info!
Yes got it! My 'problem' is that I have 167 markers (resulting in a lot different phenotypes), and that a low frequent population often clusters together with other poplulations. So I was thinking, to increase the number of SOM clusters, and then also increase the max_k to simply see whether it would even be possible to cluster those pixels separately.
@alex-l-kong for me its working now:)
Okay maybe not... I do get clustering results (# clusters=100, k=50), but it doesnt make any sense:
Could this be related to the fragmentation error?
/exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/ark/phenotyping/cluster_helpers.py:293: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling
frame.insertmany times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use
newframe = frame.copy() external_data_norm['pixel_som_cluster'] = som_labels /exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/ark/phenotyping/pixel_cluster_utils.py:348: UserWarning: Provided num_fovs_subset=100 but only 1 FOVs in dataset, subsetting just the 1 FOVs warnings.warn( /exports/immu-staallab/Janine/image_env/lib/python3.10/site-packages/ark/phenotyping/cluster_helpers.py:646: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling
frame.insertmany times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use
newframe = frame.copy()``
@janinemelsen data fragmentation doesn't relate to actual data integrity, so nothing is getting compromised. It's simply a warning that the frame in memory isn't stored in an optimal manner, which can happen if the data is big enough, your computer is on the low end in terms of memory (16 GB), and/or if your computer has a lot of processes running.
I'm more concerned about a couple other things:
k=50
metaclusters, which seems too high. The data I've worked with only requires 20 or 30. I think you may want to try something like a 20x20 SOM grid (or 200 SOM clusters) and then 30 metaclusters. @cliu72 open to more suggestions here.@alex-l-kong I have two FOVs, but the data is really big (macsima data) with 167 markers, so I am now analyzing one by one, just to see whether the clustering looks fine.
The weird thing is, that a few weeks ago, I was able to run this smoothly, with even 55 metaclusters:
Problem solved, to speed up the process I decided to train the som on a lower proportion of pixels... and increased the # of parallel pixels,. Apparently that number was set too high.... causing these issues. I am back on track now with a beautiful pixel cluster maks:)
Should be solved by https://github.com/angelolab/ark-analysis/pull/1124
Hey there,
I've been trying to run pixie locally on 37 fovs, with 22 channels but I get a ValueError when calling the
pixel_som_clustering.generate_som_avg_files
function. It seems that it expects at least 100 clusters while my data generates some 96 clusters. Is there a way to change default settings, or a work around here to write the average expression file?Error is raised based on the following version: