Implement the https://satijalab.org/seurat/reference/findsubcluster

Yunuuuu commented 1 year ago

Now, quickSubCluster will always sub-cluster all the groups, this commit enables the sub-clustering in a or more specified groups.

LTLA commented 1 year ago

Instead of making a new function, why not just add an extra argument (e.g., restrict.to= or something) to quickSubCluster? This would avoid the duplication of all the other code.

Yunuuuu commented 1 year ago

Hi @LTLA, I intended to add an argument, but the returned value by quickSubCluster now is a atomic vector or a List object based on the argument "simplify", and I think the value returned by findSubCluster should be a factor like clusterCells or a SingleCellExperiment object with the sub-clustering label added to the original object, I don't know if this is a good choice?

LTLA commented 1 year ago

I don't really see the problem here. The user can just factor the output to get a factor, if that's what they want.

It's a bit trickier to decide what to do when simplify=FALSE, but my intuition would be to do something like:

    all.sce <- list()
    all.clusters <- as.character(groups)
    by.group <- split(seq_along(groups), groups)

    if (!is.null(restricted)) { # maybe check that 'restricted' is a subset of 'names(by.group)'
        by.group <- by.group[as.character(restricted)]
    }

    for (i in names(by.group)) {
        y <- x[,by.group[[i]]]
        if (normalize) {
            y <- logNormCounts(y, exprs_values=assay.type)
        }

        y <- prepFUN(y)
        if (ncol(y) >= min.ncells) {
            clusters <- clusterFUN(y)
            clusters <- sprintf(format, i, clusters)
        } else {
            clusters <- rep(i, ncol(y)) 
        }

        y$subcluster <- clusters
        all.sce[[i]] <- y
        all.clusters[by.group[[i]]] <- clusters
    }

Basically, the reported SEs are restricted to those in restricted, but the full-length clustering vector is still returned.

Is there any reason why this wouldn't work for your use-case?

Yunuuuu commented 1 year ago

This is fine, I just wanted to keep the order of the original cell. but I don't know whether it is intensive necessary to keep the original order? I prefer to keeping the order since the data is too large to keep all the SCE object (time consuming) and I always save the cluster label data to analyze for another clustering round. Only in this way, I can use colLabel(my_sce) <- readRDS("the path of my saved cluster labels") directly in a new R session.

LTLA commented 1 year ago

I'm not entirely sure what you mean by "keep the order". Does my_sce contain all the cells? A subset of cells?

I would suggest that you modify your PR so that quickSubCluster becomes the relevant function, as suggested above. Then it may be easier for me to understand the difference between the current output format and your desired format.

Yunuuuu commented 1 year ago

Sorry for the delayed respond. Yes, my_sce contain all the cells. After double checking the code in https://github.com/MarioniLab/scran/pull/105#issuecomment-1340783277, I found this is what I want, the order of all.clusters is the same with the column order in my_sce. I'll modify the quickSubCluster code and re-pull request. Thanks for your help @LTLA

Yunuuuu commented 1 year ago

I finally modified .quick_sub_cluster as follows:

.quick_sub_cluster <- function(x, groups, restricted = NULL, normalize = TRUE,
                               prepFUN = NULL, min.ncells = 50, clusterFUN = NULL, BLUSPARAM = NNGraphParam(),
                               format = "%s.%s", assay.type = "counts", simplify = FALSE) {
    if (normalize) {
        alt.assay <- "logcounts"
    } else {
        alt.assay <- assay.type
    }

    if (is.null(prepFUN)) {
        prepFUN <- function(x) {
            # Putting this check here to avoid skipping
            # user-specified modifications.
            if (ncol(x) < 2L) {
                return(x)
            }

            # For consistency with quickCluster().
            dec <- modelGeneVar(x, assay.type = alt.assay)
            top <- getTopHVGs(dec, n = 500, prop = 0.1)
            denoisePCA(x, dec, subset.row = top, assay.type = alt.assay)
        }
    }
    if (is.null(clusterFUN)) {
        clusterFUN <- function(x) {
            clusterRows(reducedDim(x, "PCA"), BLUSPARAM)
        }
    }
    # just save the levels of the group factor and we can insert the new levels
    # in the position of the current operated group.
    subcluster.levels <- levels(groups)
    all.clusters <- as.character(groups)
    if (is.null(subcluster.levels)) {
        subcluster.levels <- unique(all.clusters)
    }
    by.group <- split(seq_along(all.clusters), all.clusters)
    if (!is.null(restricted)) {
        if (!all(restricted %in% names(by.group))) {
            stop("all groups specified in `restricted` should be in the groups")
        }
        by.group <- by.group[as.character(restricted)]
    }
    all.sce <- vector("list", length(by.group))
    names(all.sce) <- names(by.group)

    for (i in names(by.group)) {
        y <- x[, by.group[[i]]]
        if (normalize) {
            y <- logNormCounts(y, exprs_values = assay.type)
        }

        y <- prepFUN(y)
        if (ncol(y) >= min.ncells) {
            clusters <- clusterFUN(y)
            clusters <- sprintf(format, i, clusters)
        } else {
            clusters <- rep(i, ncol(y))
        }

        y$subcluster <- clusters
        all.sce[[i]] <- y
        all.clusters[by.group[[i]]] <- clusters

        # insert the new cluster labels into the original position
        operated.cluter.idx <- which(i == subcluster.levels)
        subcluster.levels <- append(
            subcluster.levels,
            sort(unique(clusters)),
            after = operated.cluter.idx
        )
        subcluster.levels <- subcluster.levels[-operated.cluter.idx]
    }
    all.clusters <- factor(all.clusters, subcluster.levels)
    if (simplify) {
        all.clusters
    } else {
        output <- as(all.sce, "List")
        metadata(output) <- list(index = by.group, subcluster = all.clusters)
        output
    }
}

And I also modified the documents for the param and return section

#' @param restricted the groups to be sub-clustered, should be a subset of
#' \code{groups}. If \code{NULL}, will use all groups.

#' If \code{simplify=TRUE}, the factor of subcluster identities is returned
#' following the format defined in \code{format}. (Unless the number of cells in
#' the parent cluster is less than \code{min.cells} or the parent cluster is not
#' specified in \code{restricted}, in which case the parent cluster's name is
#' used directly.). the order of the factor is always matched with the column
#' order of \code{x}, so we can save the subcluster results easily if \code{x}
#' is a \code{SingleCellExperiment} by \code{x$subcluster <-
#' subcluster_results}.

Yunuuuu commented 1 year ago

Can this be used? If so, I'll commit it

LTLA commented 1 year ago

Thanks for the effort. In the end, I just ended up implementing it myself to save everyone some time. You can try it out on the latest master, or by installing 1.27.1 whenever it comes out.

For future reference, here's some review comments on your code:

Line 131 seems to assume factor input for group, which is not true in general.
Line 173 enforces factor output, which is a breaking change.
Changes to the code style and formatting (automatically from Rstudio, I presume) are distracting.
Changes to the 1-sentence-1-line format of the plain-text documentation is distracting and complicates the diff.
Lines 165-171 are excessively complicated for insertion.

Yunuuuu commented 1 year ago

Thanks a lot for reviewing the code, the latest commit has provide a simple way to do this, it has already been good enough.

MarioniLab / scran

Implement the https://satijalab.org/seurat/reference/findsubcluster #105