antpiron / RedRibbon

A fast rank-rank hypergeometric overlap R package
GNU General Public License v3.0
4 stars 2 forks source link

Parallelization and C stand alone libraries #4

Open MithilG opened 3 weeks ago

MithilG commented 3 weeks ago

Hello Redribbon developers, Thank you very much for the Redribbon software. I was able to analyze most of my epigenomics datasets.

However, I have some large datasets, with 9-13 million epigenomic bins which I cannot analyze even with the high computing facilities with maximum time period 96hrs.

I was wondering to use parallelization method: mcmapply() for the quadrant function's by editing the R script rrho.r in R/ folder. The quadrant() function can be edited with following code:

enrich.ret <- parallel::mcmapply(enrichment , a =coord[1], b = coord[2] , mc.cores = mcores) After running the two R scripts in R/ folder I can run the RedRibbon() function but the quadrant() function is relying on the C standalone libraries as it gives me the following error.

Error in .Call("rrho_r_rectangle_min_ea", as.integer(i), as.integer(j),  : 
  C symbol name "rrho_r_rectangle_min_ea" not in load table

In order to circumvent the problem, I generated and load the dynamic shared library of dyn.load("rectangle_min_ea.so") in R studio, then also I am facing the following issue

rectangle_min_ea.so: undefined symbol: rrho_initial_population_func

I am not a C language user thus, resolving this issue is daunting for me because I can't find the libraries in the src/ folder which they are asking.

Could you comment if this is how one should proceed (by loading the C libraries) or there is any provision to parallelize the script, please?

Thank you very much in advance!

antpiron commented 2 weeks ago

Hi,

I never tried with that many features... But it should work reasonably well without permutation on a standard machine with sufficient memory. On a machine with many cores (>48c/96t), the permutation should be also runnable.

Enrichment part is fast and to paralyze this part will not result in a significantly faster execution.

For very large dataset (if you have not access to a machine with many cores), I will advise to deactivate the computation of the adjusted P-Value (permutation = FALSE) as in this demo code:

library(data.table)

n <- 1000000
half <- n / 2

a <- (1:n) - half
b <- sample(a)

df <- data.table(id=1:n, a = a, b = b)

rr <- RedRibbon(df, enrichment_mode="hyper-two-tailed")
quad <- quadrants(rr, algorithm="ea", permutation=FALSE, whole=FALSE)

Anthony.