Broken Jobs on Distributed Example Using Starcluster

ianmont commented 9 years ago

Hi I am trying to get the distributed example working on an aws cluster i started with starcluster. I'm using the starcluster ubuntu 12.04 ami and have installed numpy, mongodb, pymongo and spearmint on both nodes and am using SGE which seems to be working fine for simple jobs i've tried on it. The simple spearmint example runs correctly on both nodes. I also changed the line of code mentioned by peterjsadowski from -j to -j y in the qsub in SGE.py. I created the log file and db directory and ran mongod --fork --logpath log --dbpath db on the master node.

In the distributed example i changed "SLURM" to "SGE". When i run the example on the master node all jobs that run on the master node seem to execute fine however the jobs that are supposed to run on the other node give an error like: EXC: < class 'drmaa.errors.InvalidJobException' > Could not find job for rocess id 335 Broken job 85 detected.

Also all the output files for the failed jobs are blank. While the rest of the output looks fine.

I tried running the example on the other node instead and the problem is just reversed with a similar error on master node jobs and the rest working fine.

Any idea what is going on here? Thanks

JasperSnoek commented 9 years ago

Hi @ianmont, have you checked the output files produced by the jobs on the other cluster machines? It seems like the jobs are crashing, which could be from a number of reasons (possibly an issue in the code). The output files are in /output/

Jasper

On Thu, Jun 25, 2015 at 2:36 PM, ianmont notifications@github.com wrote:

Hi I am trying to get the distributed example working on an aws cluster i started with starcluster. I'm using the starcluster ubuntu 12.04 ami and have installed numpy, mongodb, pymongo and spearmint on both nodes and am using SGE which seems to be working fine for simple jobs i've tried on it. The simple spearmint example runs correctly on both nodes. I also changed the line of code mentioned by peterjsadowski from -j to -j y in the qsub in SGE.py. I created the log file and db directory and ran mongod --fork --logpath log --dbpath db on the master node. When i run the example on the master node all jobs that run on the master node seem to execute fine however the jobs that are supposed to run on the other node give an error like: EXC: < class 'drmaa.errors.InvalidJobException' > Could not find job for rocess id 335 Broken job 85 detected.

I tried running the example on the other node instead and the problem is just reversed with a similar error on master node jobs and the rest working fine.

Any idea what is going on here? Thanks

— Reply to this email directly or view it on GitHub https://github.com/HIPS/Spearmint/issues/28.

Tokukawa commented 9 years ago

I have the same problem. How has been fixed?

HIPS / Spearmint

Broken Jobs on Distributed Example Using Starcluster #28