BRANCHlab / metasnf

Scalable subtyping with similarity network fusion
https://branchlab.github.io/metasnf/
Other
5 stars 0 forks source link

Just a little note about parallelization! #11

Closed apdlbalb closed 3 months ago

apdlbalb commented 3 months ago

Hello! I just wanted to mention that parallelization, as it is currently implemented in metasnf, doesn't seem to be compatible with the HPC architecture on SciNet. Future users may consider writing their R script with the doParallel package as an alternative!

pvelayudhan commented 3 months ago

Uh-oh! Did you happen to get a specific error message of some kind? metasnf uses the "future" packages for parallelization, e.g. https://future.apply.futureverse.org/. If it is really the case that the future packages flat out do not work on SciNet, I will reach out to them to see what's going on.

But it may also be that future itself works fine and there's something wrong with the way in which I've implemented it. I'm particularly suspicious about calls to future::availableCores(), that function is used to assess how many processes are available but I am not certain that its output format is identical on every system. I'll try seeing if I can get things working on the cedar cluster as well.

pvelayudhan commented 3 months ago

Giving it a go on the Cedar cluster (not Niagara or another SciNet one specifically, but also managed by DRAC), I've found that the parallel tutorial here: https://branchlab.github.io/metasnf/articles/parallel_processing.html seems to work just fine. This was after loading the StdEnv/2023 and r/4.3.1 modules, installing only devtools, metasnf, future, future.apply and whatever dependencies they come with.

I then raised the number of rows to 100 and encountered this issue:

libgomp: Thread creation failed: Resource temporarily unavailable

I tried requesting a bash session with some specific resource requirements:

srun -c 32 -N 1 --mem 32G --pty --x11 /bin/bash

Here things seem to work fine again. I will admit things seem pretty slow on this toy data but I'm not sure if that's what you're referring to as the issue.

I'm not entirely sure what is going on, but at the very least it does look like the future package does work on something SciNet adjacent.

Could you check to see if this code works for you? It's from the Future documentation:

library(future.apply)
plan(multisession) ## Run in parallel on local computer

library(datasets)
library(stats)
y <- future_lapply(mtcars, FUN = mean, trim = 0.10)
apdlbalb commented 3 months ago

It looks like the code from the Future documentation works ok! I'll send you the job reports from my previous attempts, but I think now it's related to how future.apply differs from doParallel in how it handles memory across the processes:

/var/spool/slurm/slurmd/job12398101/slurm_script: line 13: 120875 Bus error               (core dumped) R --no-save < diagnosis_v2.R
slurmstepd: error: Detected 869 oom_kill events in StepId=12398101.batch. Some of the step tasks have been OOM Killed.

scontrol show job 12398101
JobId=12398101 JobName=job.sh
   UserId=abalbon(3116337) GroupId=lungboy(6033254) MCS_label=N/A
   Priority=1602049 Nice=0 Account=def-lungboy QOS=normal
   JobState=COMPLETING Reason=OutOfMemory Dependency=(null)
pvelayudhan commented 3 months ago

How much ram are you requesting during your submission? Could you try ramping the amount of requested memory up? And have you gotten the same SNF operation to succeed with doParallel but fail with future.apply?

Parallelization could use memory comparable to the sequential command multiplied by the number of available processes. I suspect that if you try raising the memory usage that might help. If you restrict processes = 2 that could also be a way to see if the implementation itself may be fine and just more ram needs to be requested to compensate for the number of processes being used.

apdlbalb commented 3 months ago

My mistake! I took another look at the job reports and it looks like parallelization did not work with doParallel either. I have been running metaSNF as a job with sbatch (not srun) on SciNet which assigns N x 202GB of RAM, where N is the number of nodes. I guess I've just hit capacity for running metaSNF on our HPC (5628 samples x 34 variables).

Sorry about that. Thank you very much for your time!

pvelayudhan commented 3 months ago

Phew, that is good to hear. No worries and thanks so much for flagging this!