USGS-OWI / condor-examples

Examples of using HTCondor to scale scientific processing to a computer cluster
5 stars 8 forks source link

2015-09-07 Managing multiple linux worker nodes #3

Open lawinslow opened 9 years ago

lawinslow commented 9 years ago

Background

One of the challenges I've struggled with in many Condor pools is managing multiple computers to be consistent. For many mid-sized institutions, you might have 5 or 10 or 50 individual HTCondor worker nodes. This is often too few to implement an enterprise configuration management system, but too many to be administered easily on an individual basis.

I've implemented plenty of crappy solutions. Last one was a shell script that iterated over a host list and SSH-ed into each running various commands. I knew there must be another way.

Install and define

After digging into it a bit (and having about 20 new nodes to configure), I found pssh (parallel-ssh). This is exactly what I was looking for. You define a list of hosts in a text file. image

On ubuntu, it's as easy to install as apt-get install pssh, which then shows up on your path as parallel-ssh. (yum install pssh on Centos)

Example Commands

Commands look like this.

parallel-ssh -h hosts.txt -l root -i -A -t 120 yum install netcdf-devel
parallel-ssh -h hosts.txt -l root -i -A -t 120 uptime

and so on. I'm using -l to define the user, -A to have it prompt for the password, and -t to increase the timeout to 2 minutes.

Note

I had some trouble with ssh at first. Getting past the "accept this new SSH server identity" issue, I used -O and the info I found here. Further info on this blog post and the original project itself.

-Luke