colossal-compsci / tfboot

R package for bootstrapping motifbreakR results
https://colossal-compsci.github.io/tfboot/
4 stars 3 forks source link

use built-in parallel and remove code for future map #6

Closed stephenturner closed 1 year ago

stephenturner commented 1 year ago

Deprecate the parallel_motifbreakR function, and use built-in parallelization instead.

Remove from codebase:

#' Future map over motifbreakR
#'
#' @param grl A list of GRanges objects.
#' @param cpus The number of CPUs you want to run the analysis with. Defaults to the maximum number of cores, minus 1.
#' @param ... Further arguments passed to [motifbreakR::motifbreakR].
#'
#' @return motifbreakR results as a list, one for each gene.
#' @export
#'
#' @examples
parallel_motifbreakR <- function(grl, cpus=NULL, ...) {
  stopifnot(inherits(grl, "list"))
  stopifnot(inherits(grl[[1]], "GRanges"))
  genome.package <- unique(unlist(lapply(grl, function(x) x@genome.package)))
  stopifnot(length(genome.package)==1L)
  maxcpus <- parallel::detectCores()-1
  if (is.null(cpus) || cpus>maxcpus) cpus <- maxcpus
  message(sprintf("Parallelizing using %s CPUs", cpus))
  f <- function(x, ...) {
    suppressMessages(loadNamespace(genome.package))
    suppressMessages(loadNamespace("MotifDb"))
    motifbreakR::motifbreakR(x, ...)
  }
  future::plan(future::multisession, workers=cpus)
  furrr::future_map(grl, function(x) f(x, ...))
}

Remove from vignette:

We could speed up the process by splitting the regions up by gene then running motifbreakR on each one of them. First let's use the tfboot `split_gr_by_id()` function to split this GenomicRanges object by ID (here, `"gene_id"`).

Note that we have one element in this list for each gene, and they're all of class `GRanges`, as expected.

```{r}
myprosnps_list <- split_gr_by_id(myprosnps, split_col="gene_id")
length(myprosnps_list)
unique(lapply(myprosnps_list, class))

Now, let's use the tfboot parallel_motifbreakR to run the same code in parallel across these genes, using one core per gene. This uses the furrr package to map the motifbreakR() function over this list of promoter region SNPs. By default this will use all but one of the available CPUs on your machine. See the help for ?parallel_motifbreakR for info on how to change this.

mb_pois_list <- parallel_motifbreakR(myprosnps_list, pwmList=motifs)

This doesn't save much time with only r length(mygenes) genes. However, we can expand this analysis to all r length(unique(prosnps$gene_id)) genes having SNPs on chromosome 38 for this sample.

prosnps_list <- split_gr_by_id(prosnps, split_col="gene_id")
length(prosnps_list)
head(prosnps_list, 2)
mb_res_all <- parallel_motifbreakR(prosnps_list, pwmList=motifs)