HIPS / Spearmint

Spearmint Bayesian optimization codebase
Other
1.55k stars 329 forks source link

Spearmint with starcluster #48

Open Tokukawa opened 8 years ago

Tokukawa commented 8 years ago

I am trying to use spearmint in an amazon cluster but I am facing with a problem, that i can't fix. I am using the example brainin.py that i found in the directory distributed. I changed SLURM to SGE in the configuration file and -j to -j y. I compiled spearmint and launched the example. I am using a test cluster with two nodes.

qstat -f queuename qtype resv/used/tot. load_avg arch states

all.q@master BIP 0/0/2 0.01 linux-x64

all.q@node001 BIP 0/0/2 0.01 linux-x64

  • PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS ############################################################################ 195 0.55500 branin-dis root Eqw 11/30/2015 20:42:10 1 197 0.55500 branin-dis root Eqw 11/30/2015 20:42:11 1 198 0.55500 branin-dis root Eqw 11/30/2015 20:42:19 1

As you can see all jobs in node001 terminate in a error. When i look a the a little bit in detail i see

acct -j 198

qname all.q hostname node001 group root owner root project NONE department defaultdepartment jobname branin-distributed-example-00000005 jobnumber 198 taskid undefined account sge priority 0 qsub_time Mon Nov 30 20:42:19 2015 start_time -/- end_time -/- granted_pe NONE slots 1 failed 26 : opening input/output file exit_status 0 ... ...

The input ouput error is due to the fact that spearmint is trying to read and write in the node001 filesystem. So I tryied to change SGE.py ,according to documentation with

return 'qsub -S /bin/bash -e master:%s -o master:%s -j y -N %s' % (output_file, output_file, job_name)

where i specify the node in which to read and write. The code still does not working. Any Idea?