Nested parallelization in `create_SE` via `map_df`

ChristopherEeles commented 1 year ago

In create_SE and probably elsewhere, the map_df function is called inside of already parallelized calls, resulting in nested parallelization which does not respect the number of workers set via BiocParallel::register.

Options for nested parallelization with BiocParallel can be controlled by passing a list of back-end objects to the BPPARAM parameter in bplappy. It is probably a good idea to make the default only one level of parallelization unless explicitly set by the user via BPPARAM.

It also seems a bit excessive to have so many nested parallelized calls and maybe some of them can be removed?

To reprex, pull my repo and run:

library(gDR)
library(gDRcore)
library(data.table)
library(qs)
library(SummarizedExperiment)
library(BumpyMatrix)
library(MultiAssayExperiment)
library(BiocParallel)

# -- Configure parallelization
# unset environment variable as work around for gDRcore zzz.R bug
.Options$MutlicoreParam <- NULL
nthread <- 4L
BiocParallel::register(MulticoreParam(workers=nthread))

# -- Script configuration
data_dir <- file.path("procdata")
identifiers <- "gDR"
input_file <- file.path(data_dir, "griner_preprocessed_response.csv")

# -- Data processing
#' We must skip data import, since we never have the full set of raw data
#' for our dose-response experiments
#'
#' Generally, we only have our prenormalized response values present
mathews_griner <- fread(input_file)

#' Update names to match PharmacoGx or gDR standard identifiers
gDR_style <- c(
    # PharmacoGx
    "Value"="viability",
    # gDR
    "RowName"="DrugName",
    "ColName"="DrugName_2",
    "RowDose"="Concentration",
    "ColDose"="Concentration_2",
    "RowTarget"="drug_moa",
    "ColTarget"="drug_moa_2",
    "BlockId"="Barcode",
    "AssayId"="Plate"
)
PGx_style <- c(
    # PharmacoGx
    "Value"="viability",
    "RowName"="treatment1id",
    "ColName"="treatment2id",
    "RowDose"="treatment1dose",
    "ColDose"="treatment2dose",
    "RowTarget"="treatment1_mechanism",
    "ColTarget"="treatment2_mechanism",
    "Replicate"="tech_rep",  # assume technical replicates, unless otherwise stated
    # gDR
    "BlockId"="Barcode",
    "AssayId"="Plate"
)
rename_cols <- if (identifiers == "PGx") PGx_style else gDR_style
setnames(mathews_griner, names(rename_cols), unname(rename_cols))

#' Add mandatory missing values to our dataset
if (identifiers == "PGx") {
    mathews_griner[, duration := 72L]  # assume 72 hrs exposure
    mathews_griner[, sampleid := "TMD8"]  # cell-line retrieved from publication
    mathews_griner[, tissueid := "Lymphoma"]  # tissue based on cell-line
    mathews_griner[, division_time := 24L]  # add dummy division time based on average mammalian cell-cycle
}
if (identifiers == "gDR") {
    mathews_griner[, Gnumber := as.character(rleid(DrugName))]
    mathews_griner[, Gnumber_2 := as.character(rleid(DrugName_2))]
    mathews_griner[, Duration := 72L]  # assume 72 hrs exposure
    mathews_griner[, CellLineName := "TMD8"]  # cell-line retrived from publication
    mathews_griner[, clid := as.character(rleid(CellLineName))]
    mathews_griner[, Tissue := "Lymphoma"]  # tissue based on cell-line
    mathews_griner[, ReferenceDivisionTime := 24L]  # add dummy division time based on average mammalian cell-cycle
}
mathews_griner[
    Concentration == 0,
    (c("DrugName", "drug_moa")) := "untreated"
]
mathews_griner[
    Concentration_2 == 0,
    (c("DrugName_2", "drug_moa_2")) := "untreated"
]

#' Update the gDR identifiers to match our data
if (identifiers == "PGx") {
    reset_env_identifiers()  # populates the environment with defaults
    gdr_to_pgx_identifiers <- list(
        "cellline"="sampleid",
        "cellline_tissue"="tissueid",
        "cellline_ref_div_time"="division_time",
        "drug"="treatment1id",
        "drug2"="treatment2id",
        "drug_name"="treatment1id",
        "drug_name2"="treatment2id",
        "drug_moa"="treatment1_mechanism",
        "drug_moa2"="treatment2_mechanism",
        "concentration"="treatment1dose",
        "concentration2"="treatment2dose",
        "duration"="duration",
        "barcode"="Barcode"
    )
    for (i in seq_along(gdr_to_pgx_identifiers))
        set_env_identifier(names(gdr_to_pgx_identifiers)[i], gdr_to_pgx_identifiers[[i]])
}
se <- gDRcore:::create_SE(
    as.data.frame(mathews_griner),
    nested_confounders="Barcode",
    readout="viability"
)

This is what htop looks like after like 1 minute:

You can see it is touching all 24 thread despite setting the number of workers to 4. And it eventually crashes with OOM.

Best, Chris

bczech commented 1 year ago

Hi @ChristopherEeles ,

nested parallelization has been removed from the code.

Best, Bartek

ChristopherEeles commented 1 year ago

Thanks @bczech

gdrplatform / gDRcore

Nested parallelization in `create_SE` via `map_df` #26