I am trying to use spearmint in an amazon cluster but I am facing with a problem, that i can't fix.
I am using the example brainin.py that i found in the directory distributed.
I changed SLURM to SGE in the configuration file and -j to -j y. I compiled spearmint and launched the example. I am using a test cluster with two nodes.
qstat -f
queuename qtype resv/used/tot. load_avg arch states
all.q@master BIP 0/0/2 0.01 linux-x64
all.q@node001 BIP 0/0/2 0.01 linux-x64
PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
195 0.55500 branin-dis root Eqw 11/30/2015 20:42:10 1
197 0.55500 branin-dis root Eqw 11/30/2015 20:42:11 1
198 0.55500 branin-dis root Eqw 11/30/2015 20:42:19 1
As you can see all jobs in node001 terminate in a error.
When i look a the a little bit in detail i see
The input ouput error is due to the fact that spearmint is trying to read and write in the node001 filesystem. So I tryied to change SGE.py ,according to documentation with
I am trying to use spearmint in an amazon cluster but I am facing with a problem, that i can't fix. I am using the example brainin.py that i found in the directory distributed. I changed SLURM to SGE in the configuration file and -j to -j y. I compiled spearmint and launched the example. I am using a test cluster with two nodes.
As you can see all jobs in node001 terminate in a error. When i look a the a little bit in detail i see
qname all.q hostname node001 group root owner root project NONE department defaultdepartment jobname branin-distributed-example-00000005 jobnumber 198 taskid undefined account sge priority 0 qsub_time Mon Nov 30 20:42:19 2015 start_time -/- end_time -/- granted_pe NONE slots 1 failed 26 : opening input/output file exit_status 0 ... ...
The input ouput error is due to the fact that spearmint is trying to read and write in the node001 filesystem. So I tryied to change SGE.py ,according to documentation with
return 'qsub -S /bin/bash -e master:%s -o master:%s -j y -N %s' % (output_file, output_file, job_name)
where i specify the node in which to read and write. The code still does not working. Any Idea?