Running Model Not Working: Parallel Issues?

djhocking commented 9 years ago

I tried to execute the run_model.R code from the northeast analysis repo but R just hangs indefinitely without doing anything. I've tried on osensei, felek, UNH computer, and my laptop. On the UNH computer memory and CPU are not allocated to the socket workers. I tried just running CL <- makeCluster and it just hangs. The code works in serial without the parallel processing, but that will take forever.

On my laptop, the memory is at least allocated and the three R instances are using all the CPU, so maybe it's working but there's still a problem because I only did 10 burnin and 10 iterations and it's been running for 5 hours.

I don't know how to monitor felek or osensei and diagnose the problems but they just run forever without producing anything, so there's clearly an issue. However, it shows that JAGS and coda get loaded, so it should be past the makeCluster stage. I have not idea why this is hanging forever on the servers.

I wish I were better about using and testing with git because I don't know what to even revert to. I have no idea what code was changed since when this worked (nothing in the JAGS wrapper or parallel code as far as I know).

djhocking commented 9 years ago

Strange. osensei hung over the past weekend producing nothing. I tested on my macbook, felek, and the UNH Windows 7 machine. It took a long time but worked on all except the UNH machine which still hangs with the makeCluster function. Strange that it didn't have a problem previously as that's the machine where I did the full regional model runs prior to setting up osensei.

macbook: 6.4 hours for 20 iterations on each of three cores (10 burnin, 10 sample iterations) felek: 3.6 hours for 110 iterations on each of three cores (10 burn, 100 samples) osensei: 12.9 hours for 11,000 iterations on each of three cores (10,000 burnin, 1,000 samples) osensei: 1.3 hours for 11 iterations on each of three cores (1 burnin, 10 samples) UNH Windows: NA - hangs when running makeCluster

clearly this isn't linear with increasing iterations. The initialization must be the slowest step.

djhocking commented 9 years ago

Since I will now be running this exclusively on osensei for the foreseeable future, I can ignore the makeCluster problem. There are lots of things that can go wrong when setting up the socket servers, so it's difficult to quickly diagnose, especially since I don't know how sockets work.

Since unix machines don't need sockets and they require using much more memory, AND we're going to add a lot more data to expand south to Virginia (already over 4 million nodes on the 70% calibration data for New England), it might be worth using makeForkCluster or makeCluster(..., type = "FORK"). There is a nice introduction to parallel processing at http://gforge.se/2015/02/how-to-go-parallel-in-r-basics-tips/. This will work on Linux machines and Macs and will use considerably less memory. It could be implimented within a switch call to keep the code as transportable and reproducible as possible. Something like:

switch(Sys.info()[['sysname']],
Windows= {print("I'm a Windows PC.")},
Linux  = {print("I'm a penguin.")},
Darwin = {print("I'm a Mac.")})

But putting the OS specific code in place or in addition to the print message.

djhocking commented 9 years ago

It turns out that there was a write permissions issue on the Windows machine since the last update. Still worth using the makeForkCluster for efficiency on the unix-based machines.

walkerjeffd commented 9 years ago

Sounds like this could very well be a permissions issue on any of the systems. You might try setting r/w/x permissions on all files and subdirectories of your working directory, like this in linux/mac:

chmod -R u+rwx working_directory/

the -R is for recursive. but its possible the process is also trying to read/write other files across the system, that could tough to track down.

djhocking commented 9 years ago

Yeah it worked when I went into properties for R and RStudio and changed the permissions for the user. I may have to try the chmod for the working directory for some packrat and devtools issues I'm still having. Thanks for the tip.

Conte-Ecology / conteStreamTemperature

Running Model Not Working: Parallel Issues? #31