clusters reproducibility and size

mmfalco commented 1 year ago

I have run CONGAS on several cancer samples and overall I am quite happy the sensitivity of the regions identified as subclonal and the subclones it finds. However, there are some samples where CONGAS finds clusters composed by only 1 cell (sorry but I can't share the data). Moreover, the identity of the cell clusters sometimes change (I thought that the seed parameter would solve this), and of course it also does when you change the number of clusters you test. But this seems to be influenced by the small subclusters found, hence the important of filtering small clusters. For example this is the clustering result for 2 consecutive runs of the same dataset with the same parameters (inference <- best_cluster(input_rcongas, clusters = 1:5, model = "MixtureGaussian", param_list = list(theta_shape = theta_vals[1], theta_rate = 1),steps = 500, lr = 0.01, MAP = TRUE, seed = 3, normalize_by_segs = F, method = "ICL"))

    run2
run1  c1  c2  c3  c4
  c1 314  13   0   2
  c2 294   7   0   0
  c3   1 221   0   0
  c4   0   1  48   3

this is the comparison after running again changing clusters = 1:10:

    run3
run1  c1  c2  c3  c4  c5  c6  c7
  c1 326   3   0   0   0   0   0
  c2   0 299   1   0   0   1   0
  c3   1   1 197   0  23   0   0
  c4   0   0   0  51   0   0   1

    run3
run2  c1  c2  c3  c4  c5  c6  c7
  c1 311 296   1   0   0   1   0
  c2  14   7 197   1  23   0   0
  c3   0   0   0  47   0   0   1
  c4   2   0   0   3   0   0   0

As you can see the emergence of a small subcluster on the second run has made the clones c1 and c2 from run1 to collapse into 1 clone in run2. The validity of clones detected in run1 seems to be verified by the third run (finds sames clones) now that it has room to find this smalls clusters. As you can see in run3 we have c6 and 7 composed by 1 cell. Is there anything you would suggest to increase reproducibility? Would it make sense to add a parameter to limit the minimum size of the clusters? Or maybe it can also be interesting to be able to filter "undesired" clusters a posteriori from the results and being able to use plot_gw_cna_profiles without having to run again the analysis.

Militeee commented 1 year ago

Hi @mmfalco, thanks again for using our tool. This hypersensitivity to small clusters is actually something I know about and that we are trying to solve in the new version (that if you are interested you can find under the branch categorical, and as for now provides a better estimate of the actual CNV values, which is a bit of the weak point of CONGAS). We have a function to filter post-hoc: rcongas_object_filtered <- filter_clusters(rcongas_object, ncells, abundance)

Where ncells is the minimum number of cells and abundance is the minimum mixture proportion for that cluster.

Usually, in my experience, those small clusters are from peculiar cells (like doublets or high mito %), so you might also think of filtering them before, Nevertheless, I agree that we should fix it at some algorithmic level (filtering high residual cells). Best way to provide a robust result is probably to run 5 times best_cluster and then choose the one with the best BIC, another suggestion is to set a higher number of steps like 1000 (and maybe a small lr, so you have more possibilities of reaching a stable minimum)

S.

mmfalco commented 1 year ago

Thanks, I will try the newer version! BTW, I tried filter_clusters() function and got the following error:

> filter_clusters(inference,ncells = 10, abundance = 0.03)

! Filtering 1 cluster due to low cell counts or abudance
ℹ Reculcating cluster assignement and renormalizing posterior probabilities
! Filtering 2 clusters due to low cell counts or abudance
ℹ Reculcating cluster assignement and renormalizing posterior probabilities
! Filtering 3 clusters due to low cell counts or abudance
ℹ Reculcating cluster assignement and renormalizing posterior probabilities
! Filtering 3 clusters due to low cell counts or abudance
ℹ Reculcating cluster assignement and renormalizing posterior probabilities
! Filtering 5 clusters due to low cell counts or abudance
ℹ Reculcating cluster assignement and renormalizing posterior probabilities
! Filtering 4 clusters due to low cell counts or abudance
ℹ Reculcating cluster assignement and renormalizing posterior probabilities
! Filtering 3 clusters due to low cell counts or abudance
ℹ Reculcating cluster assignement and renormalizing posterior probabilities
Error in calculate_information_criteria(x$inference$models, x, method,  : 
  argument "normalize_by_segs" is missing, with no default

mmfalco commented 1 year ago

Moreover, when I manually modify filter_clusters function, so the internal function recalculate_information_criteria has the normalize_by_segs=F argument, it seems that filter_clusters only uses the abundance for filtering, where it should get whichever is greater between the number of cells or percentage instead:

> get_clusters_size(congas_obj)
 c1  c2  c3  c4  c5 
147 120  75   2   1 
> congas_obj<-filter_clusters(congas_obj,ncells = 80, abundance = 0.03)
! Filtering 1 cluster due to low cell counts or abudance
ℹ Reculcating cluster assignement and renormalizing posterior probabilities
! Filtering 2 clusters due to low cell counts or abudance
ℹ Reculcating cluster assignement and renormalizing posterior probabilities
> get_clusters_size(congas_obj)
 c1  c2  c3 
150 120  75

Militeee commented 1 year ago

Hey, so the actual functions currently cut the higher of the two. It was supposed to work that way, but if you think it would be more useful to have other logical intersections we can add them. I'm on my way to correct the bug (or if you would like to and you already made it work, you can send a pull request)

Militeee commented 1 year ago

Bug fixed! If you need further help just ask

caravagnalab / rcongas

clusters reproducibility and size #26