athchen / beer

Bayesian enrichment estimation in R
Other
9 stars 1 forks source link

BEER - BiocParallel advise #8

Closed martinezvbs closed 3 months ago

martinezvbs commented 3 months ago

Hi,

I am trying to use BEER for a matrix, see below:

class: PhIPData 
dim: 40671 27 
metadata(2): NCBI Fragment
assays(3): counts logfc prob
rownames(40671):
rowData names(0):
colnames(27): 1 2 ... 26 27
colData names(4): sample condition type group
beads-only name(2): beads
  1. Before creating the PhIPData object I removed all rows where the total counts were equal to 0 (from 600K I got 40K)

On R, I was running the following for the differential analysis

seminoma_data <- PhIPData(counts = seminoma_counts, sampleInfo = seminoma_metadata, metadata = Notes)
seminoma_differential <- runEdgeR(seminoma_data, BPPARAM = BiocParallel::SerialParam(),
                                  assay.names = c(logfc = "logfc", prob = "prob"))
seminoma_differential <- brew(seminoma_differential, assay.names = assay_locations, 
                              BPPARAM = BiocParallel::SerialParam())

However, every time that I try to run the above, R stops working (I also tried with MultiCore for Parallel). I migrated to R server (30 cores per task and a lot of memory). I was using the following code but stills crashes.

library(BiocParallel)
register(MulticoreParam(30))

In this case, what it would be good to change? I was thinking in reducing the number of rows, however, I would like to try something else before trimming the counts.

My system: R 4.3.2 / BEER 1.6.0 / PhIPData 1.10.0 /

Thanks!

athchen commented 3 months ago

I think 40K is still a large number of rows, especially for brew. I'm not sure about runEdgeR (as that is based on the edgeR package, so if that command is also crashing then 40K might just be too large of a dataset. The parallelization occurs on a per-sample basis, so if both commands are crashing, then to me that suggests you may want to break your 40k dataset into smaller datasets (probably around 3000 or less is what we've tested, but you could try like 10k). Like you mentioned you could reduce the rows by setting some counts threshold, using some outlier detection method, or by just partitioning the dataset. Hope this helps!