Modify clustering script to store all results in output SCE

allyhawkins commented 2 years ago

Please provide some background on the proposed additions or changes.

As part of the core pipeline, we will want to be able to output a SCE object with clustering assignments included in the colData slot of the SCE. Based on what we know about clustering, there is not one size fits all, so we will want to have the capability of including results from multiple types of clustering algorithms (i.e. k-means and graph-based) that have been tested on a range of parameters.

What are the changes that you are proposing?

Here, we will need to create an Rscript that takes as input a normalized SCE with reduced dimensions already calculated. The script will then perform both graph based and k-means clustering on a range of parameters. Additionally, we will also calculate the cluster purity, silhouette width, and check the stability of the clusters, storing all information in the SCE object so that it can be used for plotting when creating an output html report (a separate issue). The output file should be an SCE object with all clustering stats appended. **Please describe the proposed solution.**

The current code that is used in 04-clustering.Rmd can be converted to an Rscript and we should require the file path to the SCE object and a seed as the input. The output should be the modified SCE object.

What potential "gotchas" do we know of?

Another thought that we might consider is adding an argument to pick the type of clustering. By default we should perform both clustering methods, but we might want to allow users the ability to discern which type of clustering they would like.

jashapiro commented 2 years ago

I am not sure we necessarily need to include k-means clustering in this output; I do not think it is widely used in production. The discussion in OSCA describes it only in the context of reducing computation by performing k-means with a large k, followed by other clustering, see https://bioconductor.org/books/3.14/OSCA.basic/clustering.html#in-two-step-procedures.

More generally, I am not sure that we will want to store everything in the SCE object. The comparisons among clusters could potentially get pretty large and complicated. For example, we will have a table of sillhouette results from each type of clustering, a table of purities, etc. While we could store all of those in the SCE object, there are not defined storage locations, and I worry that it will add complexity beyond what we really need.

I wonder if we might want to store only the cluster assignments, and leave the cluster evaluation to a separate script/report?

allyhawkins commented 2 years ago

More generally, I am not sure that we will want to store everything in the SCE object. The comparisons among clusters could potentially get pretty large and complicated. For example, we will have a table of sillhouette results from each type of clustering, a table of purities, etc. While we could store all of those in the SCE object, there are not defined storage locations, and I worry that it will add complexity beyond what we really need.

My first comment is that we have a function that adds a summary table to the metadata of the SCE already so that the widths and purities are combined into one table already. We then have been storing them in the metadata currently. However, I will say I like the idea of separating out the stats. One thought is that we could incorporate this into the clustering script though and have it be an option to also output the stats as a separate object.

allyhawkins commented 2 years ago

Updating this issue to reflect comments that were discussed in today's research focus meeting:

The clustering script mentioned here will not be included in the core module, but will be a separate script that will take as input the type of clustering to perform (Louvain, walktrap, or 2-way k-means), the range of parameters to test, and the increments to test that range
The output will include a SCE object with the cluster assignments from each of these results in the colData, the clustering statistics in a separate data frame (as discussed in #73), and an html report showing the results of the different clustering

Before tackling this issue we should address changes to clustering related functions, plotting functions, and create the template html report that will be rendered and then return to this.

cbethell commented 2 years ago

Upon further discussion with Ally, we decided to reformat this issue a bit and have it reflect the progress of an initial clustering calculations script.

The idea here would be that the clustering R script would perform all of the relevant clustering calculations and save the results to the SCE object, allowing us to supply the SCE object with the results as input to the clustering template notebook (a subsequent issue). This means that we are separating the calculations from the actual plotting/display of the calculated results.

allyhawkins commented 2 years ago

Just wanted to note that we also want to output a data frame with the statistics within the same script and snakefile rule, but I think addition of that might be a separate issue.

allyhawkins commented 2 years ago

Closed by #191

AlexsLemonade / scpca-downstream-analyses

Modify clustering script to store all results in output SCE #67