HelenaLC / CATALYST

Cytometry dATa anALYsis Tools
66 stars 30 forks source link

plotDiffHeatmap : ambiguity when multiple clusterings could match #381

Open SamGG opened 7 months ago

SamGG commented 7 months ago

IIUC, when multiple columns of cluster_codes could match the levels of cluster_id, the first one is selected silently. https://github.com/HelenaLC/CATALYST/blob/f3e294ed9a4d3f300feb994bb381df6a6b2c8309/R/plotDiffHeatmap.R#L140-L151

IMO, if sum(same) > 1 a warning must be raised show the selected column/clustering, or, an error must be raised to force the user to select the clustering column (k argument).

HelenaLC commented 6 months ago

Thanks for this! I agree that this is a weak point. I think the most suitable place for this would actually be in .check_k here, independent of plotDiffHeatmap. Specifically, all functions accepting a k argument typically call k <- .check_k(sce, k) and then use cluster_ids(sce, k) to retrieve the corresponding assignments. So, .check_k could include a warning when multiple clusterings match k, as you suggested - agreed? I'll try to implement this soon, and thank you kindly again.

SamGG commented 6 months ago

Thanks for your feedback. Here are my thoughts, probably not crystal clear.

  1. The vapply code I cited above smartly matches the levels of the y diffcyt object with the levels of the x CATALYST object. This goal could not be addressed by .check_k because .check_k works only x.
  2. In fact, I think I raised the current issue because I wrongly tried to hack the CATALYST object in order to manage multiple independent clustering. Having reviewed the .check_k and cluster_ids functions, it's now clear to me that it is not possible to add a column to cluster_codes. cluster_codes and cluster_ids are intimately linked together. The aims of cluster_codes is to store the hierarchical merging of high a resolution clustering into lower resolution clusterings. There could be only one single high resolution clustering in cluster_codes. If I want to add a clustering (a high resolution clustering, e.g. the result of kmeans), I need to replace cluster_ids AND cluster_codes together. That's what you wrote in section 8.2 of the vignette, but I read it too quickly.

So, I think that:

  1. there is no need to change .check_k.
  2. there should be no reason to get a more than one result in the vapply code, but it would be good to raise an error in the case there is more than one result.
  3. we could try to improve the vignette by a) first stating the link/complementary between cluster_ids and cluster_codes and b) giving a simpler example such as kmeans or encompassing the current code in a function that returns a simple vector of numerical values.

We can discuss this offline if you want. Sorry for my misunderstanding.