angelolab / ark-analysis

Integrated pipeline for multiplexed image analysis
https://ark-analysis.readthedocs.io/en/latest/
MIT License
69 stars 25 forks source link

Don't require user to have all pixel SOM clusters represented during assignment #1124

Closed alex-l-kong closed 3 months ago

alex-l-kong commented 4 months ago

What is the purpose of this PR?

During SOM clustering, it is possible in some cases that not all unique clusters discovered during training will be present in the dataset. In these cases, it's too strict for the SOM cluster mapping function to require all the SOM clusters to be represented.

How did you implement your changes

Add an additional parameter to the SOM cluster assignment function that is defaulted to not require all the SOM clusters to be present. This can be explicitly set to revert back to the original format.

The underlying averaging function is also adjusted accordingly.

cliu72 commented 4 months ago

@alex-l-kong I don't think this is the best solution to the problem. The reason that this check is even in there in the first place is because when we have cohorts of like 2000 FOVs, it would take forever to generate the average file. So we decided to use a subset of all FOVs to generate the average file (the thought is that a subset is a representative sample of all clusters). However, when we take a subset of all FOVs, it's possible that some rare clusters are not included in that subset. For example, if there is a cluster that is in like 1000 FOVs, but not in the 100 subsetted FOVs, that cluster would not be in the average file. This would lead to downstream problems, since that cluster number does exist in some FOVs, but it didn't get a metacluster assignment. Therefore, we thought that adding a test to make sure all clusters are represented would be a good way to get around this issue. If not all clusters are represented in the average file, then the number of FOVs subsetted needs to be increased.

The issue arising now is that if we specify 100 SOM clusters, maybe only 98 SOM clusters are actually assigned across ALL FOVs. So in the average file, there are 98 SOM clusters. This is a different situation than if there are truly 100 SOM clusters assigned across all FOVs, but only 98 of them are in the subset of FOVs used for the average file generation. So the fundamental issue is that when we are checking for the number of clusters, we are using the weights file. However, the weights file can specify 100 clusters, when across all FOVs there are only 98 clusters. That's why the solution that I suggested in the issue is to keep track of the total number of clusters during the mapping step. Open to other suggestions though.

alex-l-kong commented 4 months ago

@cliu72 thanks, I just realized that on my end too and am in the process of modifying it.

alex-l-kong commented 4 months ago

@cliu72 I still think the require_all_som_clusters argument could still be useful in some niche first scenario cases however. It was a bit of an annoyance to a few lab members in the past. We can allow the function to take None as a parameter but hide that functionality and default to requiring an explicit value (which will be the total number of SOM clusters seen).

cliu72 commented 4 months ago

@alex-l-kong I think it's fine to have this as a temporary solution, but if we track the total number of SOM clusters seen and use that to make sure the average file is written correctly, this error should never come up. If it does, there is another problem.

alex-l-kong commented 4 months ago

@alex-l-kong I think it's fine to have this as a temporary solution, but if we track the total number of SOM clusters seen and use that to make sure the average file is written correctly, this error should never come up. If it does, there is another problem.

Ok I did a quick test by removing one SOM cluster, I think this does the trick now.