Running PCA on o2 cluster

dattalab / moseq2-pca

Code for computing pcs from extracted data to submit for modeling

Other

0 stars 1 forks source link

Running PCA on o2 cluster #2

Closed wingillis closed 6 years ago

wingillis commented 6 years ago

The command I ran: moseq2-pca train-pca -i _aggregate_results/ —cluster-type slurm --missing-data --missing-data-iters 15 -q short -n 20

The error during execution of the program:

Traceback (most recent call last):
  File "/home/wg41/miniconda2/envs/mo2/lib/python3.6/site-packages/tornado/ioloop.py", line 1208, in _run
    return self.callback()
  File "/home/wg41/miniconda2/envs/mo2/lib/python3.6/site-packages/distributed/client.py", line 850, in _heartbeat
    self.scheduler_comm.send({'op': 'heartbeat'})
  File "/home/wg41/miniconda2/envs/mo2/lib/python3.6/site-packages/distributed/batched.py", line 106, in send
    raise CommClosedError
distributed.comm.core.CommClosedError

Have you ever ran PCA on o2 before? Or just GCE? The program output says it has been 2% complete for the last 10 minutes. I'm guessing this is not normal considering the speedups normally seen.

jmarkow commented 6 years ago

Yeah I haven't gotten this to run efficiently on O2 (local obviously works fine, i.e. single node). Everything is much faster on GCE, I recommend using that until we get to the bottom of why communication is unreliable on O2.

jmarkow commented 6 years ago

As a test, maybe request nodes connected by infi-band, I think you can request this on O2, but I don't remember the commands.

wingillis commented 6 years ago

Great, I’ve already begun the transfer. I also added a note to the docs for other people.

Best, Win

On May 20, 2018, 4:18 PM -0400, Jeffrey Markowitz notifications@github.com, wrote:

Yeah I haven't gotten this to run efficiently on O2 (local obviously works fine, i.e. single node). Everything is much faster on GCE, I recommend using that until we get to the bottom of why communication is unreliable on O2. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

jmarkow commented 6 years ago

Sounds good. We could also present the issue to the Dask folks, they're good at debugging issues quickly...though this might be specific to O2. I've gotten this to work sporadically on O2 and maybe it's related to the network latency between the scheduler and workers (raise with O2 folks?).

I haven't heard from anyone else trying this on Slurm yet.

jmarkow commented 6 years ago

I don't see anything on O2's wiki about specifying Infiniband as a constraint, though Orchestra let you do this,

https://qa.rc.hms.harvard.edu/questions/11/does-the-cluster-have-infiniband

You could try adding this to sbatch,

--constraint=IB

which seems to be the standard way of specifying the network interface AFAIK,

https://hpcc.usc.edu/support/documentation/slurm/

jmarkow commented 6 years ago

I also assume this is working now with the last batch of improvements. Feel free to reopen.