HPCNow / slurm_simulator

Dockerfiles to create a slurm simulator
GNU General Public License v3.0
3 stars 3 forks source link

Cannot use a vanilla, out-of-the-box simulator instance #1

Open marcodelapierre opened 1 year ago

marcodelapierre commented 1 year ago

Hi team, @jordiblasco,

thanks for this utility, it looks very promising!

I was reading through your blog post, to try and spawn a vanilla instance of the simulator on a Linux VM I use. I followed the prompts in the blog:

$ docker run --rm --detach \
           --name "${USER}_simulator" \
           -h "slurm-simulator" \
           --security-opt seccomp:unconfined \
           --privileged -e container=docker \
           -v /run -v /sys/fs/cgroup:/sys/fs/cgroup \
           --cgroupns=host \
           hpcnow/slurm_simulator:20.11.9 /usr/sbin/init

$ docker exec -ti ${USER}_simulator /bin/bash

# sinfo

and tested it with three different Slurm versions:

The last command in the snippet above, sinfo, always gives me an error:

# versions 20.11.9 and 22.05.2

slurm_load_partitions: Unable to contact slurm controller (connect failure)
# version 23.02.4

sinfo: error: Ignoring BackupController since SlurmctldHost is set.
sinfo: error: get_addr_info: getaddrinfo() failed: Name or service not known
sinfo: error: slurm_set_addr: Unable to resolve "slurm01"
sinfo: error: Unable to establish control machine address
slurm_load_partitions: No error

What am I doing wrong? Could you provide some concise guidance to get a minimal working setup?

Thank you in advance, Marco

marcodelapierre commented 1 year ago

Note I was doing my tests based on your blog post at https://hpckp.org/articles/how-to-use-the-slurm-simulator-as-a-development-and-testing-environment/

marcodelapierre commented 1 year ago

Today I have found a couple of commands to run after the very first login into the running container with the Slurm simulator (they make sense -- daemon services need to be started):

systemctl start slurmctld
systemctl start slurmd

This page gave me the hint: https://drtailor.medium.com/how-to-setup-slurm-on-ubuntu-20-04-for-single-node-work-scheduling-6cc909574365

It would probably be good if this could be double-checked and added to your original blog post, to enable out-of-the-box tests.

jordiblasco commented 1 year ago

Hi @marcodelapierre ,

Thank you for bringing that to me. The images are designed to start the required services via systemd. If they are not starting, it could be because one of the following reasons:

Can you describe the working environment (OS distro+version, Docker version, Docker from official repo or from Linux distribution)? Have you followed the instructions provided in the official Docker documentation? https://docs.docker.com/engine/install/

BTW, the simulator dockerfiles used for creating the images are in our private Git repository.

Cheers,

Jordi

marcodelapierre commented 1 year ago

Thanks for getting back on this Jordi, it is always a pleasure for me to chat to you. (we met a couple of times in Perth, at a HPC/AI conference and at your Kubernetes training at Pawsey in 2020).

I am running these tests on a Ubuntu 22.04 virtual machine on our on-prem Openstack infrastructure. The docker version is Docker version 24.0.2, build cb74dfc . Not sure how it was installed (it is part of our pre-canned image), but I can check with the team.

What do you reckon? Thank you