AlexsLemonade / OpenScPCA-analysis

An open, collaborative project to analyze data from the Single-cell Pediatric Cancer Atlas (ScPCA) Portal
Other
9 stars 17 forks source link

Add function for cluster stability #779

Closed sjspielman closed 1 month ago

sjspielman commented 1 month ago

Closes #773

This PR adds a function to bootstrap clusters and calculate ARI for a given number of reps. I ended up writing a that mostly wraps calculate_clusters(), thereby letting that function handle argument checking (on the first bootstrap iteration). This function differs from the other evaluation functions in that it takes a vector of clusters (hence, I check it's not a data frame; I do that b/c, as I learned today, is.vector(df$column) is FALSE). The function returns a data frame of ari results and clustering parameters, as returned by calculate_clusters().

Note also that I updated examples across function docs to use a seed; we want to encourage seeds! (I'll also note, at one point I had the cute idea of actually providing cluster_df to this stability function and grabbing cluster parameters directly from the df, but changed my mind because, mainly, we won't necessarily know all the parameter columns that could be in that df because of cluster_args, and users may have added their own.).

sjspielman commented 1 month ago

Should be ready for another look! Note also that I had to keep the pc_name argument since it's not one of the arguments pass into calculate_clusters() (it could be, but I'd prefer to extract the matrix, if needed, once and not each iteration).

sjspielman commented 1 month ago

Thought of one more edge case - the nrow/length check between the matrix and clusters will pass even if clusters is not a vector, but eg a data frame with the same number of columns as the matrix. This is really unlikely to happen but can't hurt to catch. I updated here if you want to have a look: https://github.com/AlexsLemonade/OpenScPCA-analysis/pull/779/commits/c38ff0a4ba4f81b9b0199b60542865094360b97d

jashapiro commented 1 month ago

Thought of one more edge case - the nrow/length check between the matrix and clusters will pass even if clusters is not a vector, but eg a data frame with the same number of columns as the matrix. This is really unlikely to happen but can't hurt to catch. I updated here if you want to have a look: c38ff0a

I can imagine this failing in other ways you don't expect that would otherwise be fine (any object with an attribute will fail is.vector, not just factors; for example I think a list of clusters would actually work in the function), so I personally would not have bothered with this. If people do horrible things that result in failures down the line, we can't always stop them.