Resolution in random_trees method

Aeg22 commented 1 year ago

First, thank you for the tool! I have found it to work quite well in accordance with clinical notes.

I am using random_trees with my data to obtain hierarchical structure of subclusters for later use with phyloplot2. The issue I am experiencing is that there are always 8 subclusters regardless of which sample I am run. For all samples, this seems like too high of a resolution of subclusters, which diminishes downstream analysis. Are there parameters to tweak to fine tune the number of clusters? I have set tumor_subcluster_pval = 0.001 but still results in 8 subclusters.

Potentially related, I also noticed another user has frequently seen 8 subclusters using random_trees. Could there be a reason for this related to the method?

If there is not a method to influence the number of clusters, I am wondering if re-plotting the data using the 8 random_trees subclusters followed by manual trimming of the cluster names (ex. changing 1.1.1.1 and 1.1.1.2 both to 1.1.1) would be appropriate? This will be time consuming across samples and I am not sure if it will be the most accurate.

Aeg22 commented 1 year ago

I tested with a subset of 70 cells from a visibly homogenous sample and there I see 6 subclusters instead of 8. However, this still seems far too high based on the low number of cells and similarity in features.

infercnv.pdf

I have also seen examples where a random_trees subcluster contains a single cell.

GeorgescuC commented 1 year ago

Hi @Aeg22 ,

Unfortunately it is a limitation of how this random trees subclustering implementation works, and due to other technical limitations (mainly very long run times as data size increases), it is not a method we have worked to improve further. Instead, we have added a new subclustering option using the Leiden algorithm which is significantly better and faster. The downside being that there is no direct hierarchy between the clusters. You can however most likely run a basic hierarchical clustering on a cell sampled from each subcluster in the HMM results, or on an average/median of each subcluster's residual expression to get back to a hierarchy between the subclusters. The leiden subclustering has many options you can change which will affect the resolution, one of them being the fittingly named leiden_resolution parameter.

Regards, Christophe.

broadinstitute / infercnv

Resolution in random_trees method #498