aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.
https://github.com/aws/aws-parallelcluster
Apache License 2.0
832 stars 312 forks source link

could ram_free be added to the qconf parameters? #308

Closed jprobichaud closed 4 years ago

jprobichaud commented 6 years ago

I couldn't find that information in the documentation, if this is a case of RTFM, let me know!

For many memory intensive scripts, it is useful to be able to speicify the amount of free ram that should be available on an SGE compute node to accept a job. I managed to sudo su sgeadmin, setup the shell variables and issue qconf -mc to add ram_free ram_free MEMORY <= YES YES 1G 0 and now I will have to issue

qconf -me $compute_node_name

to set the available memory there.

This isn't super easy to do at this point, especially with always changing hosts. Could we have a real solution for this?

FWIW: I got these instructions from this post in the kaldi forums

rajachan commented 6 years ago

@jprobichaud, that's a fair request. There isn't an easy way to do this with CfnCluster today as we just have a default installation of SGE which doesn't track compute node memory. For the time being, you could look into adding the ram_free configuration to a post_install script that would kick in every time a compute node is fired up.

I'll leave this ticket open for tracking this as a feature request.

jprobichaud commented 6 years ago

BTW, I tried to add the following commands inside the post_install.sh script, but unfortunately, the hosts aren't added as execution hosts when the post install script is executed.

ram_free=$(grep MemTotal /proc/meminfo | awk '{print $2}' | perl -nle '$a=$_; $g = int($a /1024/1024); print $g,"G";')
u -c "SGE_ROOT=/opt/sge /opt/sge/bin/lx-amd64/qconf -mattr exechost complex_values ram_free=$ram_free,exclusive=true `hostname`" sgeadmin

This generates an error that looks like:

denied: host "ip-172-31-17-49.us-west-2.compute.internal" is neither submit nor admin host
nstornetta commented 4 years ago

Because we have announced that we will be deprecating support for SGE in the near-future (see: https://github.com/aws/aws-parallelcluster/wiki/Deprecation-of-SGE-and-Torque-in-ParallelCluster), we will not be performing additional enhancements specific to SGE.

I am going to close this issue. If you would like to request a similar enhancement for one of our other supported schedulers (Slurm or AWS Batch), please feel free to create a new issue.