artpol84 / slxc

SLURM in Linux Containers
BSD 3-Clause "New" or "Revised" License
8 stars 3 forks source link

Question #3

Open thistleknot opened 3 years ago

thistleknot commented 3 years ago

Is it possible to use this across nodes (rather than one node?).

I have 3 nodes x1 8 cores x2 4 cores

What I'd like to do is setup slurmctld that talks to lxc containers running slurmd across 4 containers (that are limited to 4 cores each) on these 3 nodes for a total of 16 cores.

I've read of a way to manually create a crgoups directory, but when I run jobs the node goes down. I'm not sure what you are doing to resolve the cgroups issue.

My OS of choice is Oracle Linux (a RHEL compatible flavor)

artpol84 commented 3 years ago

Thanks for your interest, @thistleknot. This project was created as a development tool for Slurm. This is what I was using it for primarily. So the emphasis was on a single node. What is your goal here? Do you want to work with containers? If containers are not a must, then Slurm has a multihost feature (don't remember the name precisely). I have an example slurm.conf that configures it: https://github.com/artpol84/poc/blob/master/slurm/multihost_conf/slurm.conf Note the line

NodeName=cn[1-16] NodeAddr=localhost Port=32221-32236 CPUs=4 

That I believe enables a single node to host 16 processes. You can have multiple of such lines and it will allow you to achieve what you are looking for.

Unfortunately we are not currently developing this project (yet still using though) so adding multi-node configuration won't be in the todo list.

thistleknot commented 3 years ago

I thought about using lxc clustering to achieve this

On Wed, Mar 31, 2021, 10:43 AM Artem Polyakov @.***> wrote:

Thanks for your interest, @thistleknot https://github.com/thistleknot. This project was created as a development tool for Slurm. This is what I was using it for primarily. So the emphasis was on a single node. What is your goal here? Do you want to work with containers? If containers are not a must, then Slurm has a multihost feature (don't remember the name precisely). I have an example slurm.conf that configures it: https://github.com/artpol84/poc/blob/master/slurm/multihost_conf/slurm.conf Note the line

NodeName=cn[1-16] NodeAddr=localhost Port=32221-32236 CPUs=4

That I believe enables a single node to host 16 processes. You can have multiple of such lines and it will allow you to achieve what you are looking for.

Unfortunately we are not currently developing this project (yet still using though) so adding multi-node configuration won't be in the todo list.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/artpol84/slxc/issues/3#issuecomment-811167988, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHKKOUUMMPD2QYX5G6LK4LTGM7K3ANCNFSM42EW5PNQ .

artpol84 commented 3 years ago

If you happen to extend this project to multiple nodes - I'll be happy to integrate this here. If I'd be doing this now (I've started it in 2014) I'd probably opted for Docker as it is much more stable and portable. We actually have a half-baked Docker solution here: https://github.com/artpol84/slurm-pmix-test

But I'm not sure it's efficient enough as I'm not an expert with Docker.

thistleknot commented 3 years ago

you asked earlier what my intention was I'm leaning towards containers because if I can get it to work with that, I can then host from any host isolating resources down to a granular level.

the nice thing about lxc containers is from whatever node you run lxc ls -a from, it acts as if it's on a single machine. Which means you can have the containers hosted across nodes. I haven't really used it yet. But I thought about moving my lxc storage pool to a distributed volume (like glusterfs) and then run the containers from that so I'm not choking my head node with data throughput. Anyways, that's neither here nor there, but the plan was using the cluster solution. I could have my containers across nodes, but using your software it would still look like it was just on the head node.

artpol84 commented 3 years ago

Out of curiosity, have you managed SLXC to work for you? It's a bit tricky to set it up initially. It works pretty stable afterwards though,

artpol84 commented 3 years ago

I mean in a single-node installation.

thistleknot commented 3 years ago

soon. I plan on getting it up this week

thistleknot commented 3 years ago

I was hoping to use rpm provided installations of slurmctld which installs slurmctld in /usr/sbin

but your guide details $SLURM_PATH/var $SLURM_PATH/etc

so it seems you are suggesting I use a compiled installation with an opt path defined instead?

because normally if I install slurm-slurmctld the etc falls under /etc/slurm/

thistleknot commented 3 years ago

https://www.thegeekdiary.com/how-to-install-an-rpm-package-into-a-different-directory-in-centos-rhel-fedora/ rpm -ivh --prefix=/opt rsync-2.5.7-5.3E.i386.rpm

thistleknot commented 3 years ago

I'm using snapd lxc which makes things a bit difficult

/var/lib/snapd/snap/lxd/

for one I don't have a dnsmasq.conf to edit

thistleknot commented 3 years ago

dang it ./root/slxc/~/munge-0.5.11-3.el7.x86_64.rpm

to do this I'd have to download and build the latest builds of munge and slurm

thistleknot commented 3 years ago

I need better instructions. When I attempt to compile munge and slurm with --prefix. I can do munge, which doesn't install munge-devel. But when I go to compile slurm it wasn't munge-devel

but I can't build/install munge-devel without rpm-build and when I attempt rpm-build with a ./configure --prefix=/opt/munge.xxx it will fail building on


/usr/bin/ld: unmunge-xsignal.o: relocation R_X86_64_32 against `.rodata.str1.8' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Nonrepresentable section on output
collect2: error: ld returned 1 exit status
make[3]: *** [unmunge] Error 1
make[3]: *** Waiting for unfinished jobs....
/usr/bin/ld: remunge-remunge.o: relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: remunge-xgetgr.o: relocation R_X86_64_32 against `.rodata.str1.8' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: remunge-xgetpw.o: relocation R_X86_64_32 against `.rodata.str1.8' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: remunge-xsignal.o: relocation R_X86_64_32 against `.rodata.str1.8' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Nonrepresentable section on output
collect2: error: ld returned 1 exit status
make[3]: *** [remunge] Error 1
/usr/bin/ld: munge-munge.o: relocation R_X86_64_32S against `.rodata.str1.1' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: munge-read.o: relocation R_X86_64_32 against `.rodata.str1.1' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: munge-xgetgr.o: relocation R_X86_64_32 against `.rodata.str1.8' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: munge-xgetpw.o: relocation R_X86_64_32 against `.rodata.str1.8' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: munge-xsignal.o: relocation R_X86_64_32 against `.rodata.str1.8' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Nonrepresentable section on output
collect2: error: ld returned 1 exit status
make[3]: *** [munge] Error 1
make[3]: Leaving directory `/root/rpmbuild/BUILD/munge-0.5.14/src/munge'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/root/rpmbuild/BUILD/munge-0.5.14/src'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/root/rpmbuild/BUILD/munge-0.5.14'
make: *** [all] Error 2
error: Bad exit status from /var/tmp/rpm-tmp.lvcsRB (%build)

RPM build errors:
artpol84 commented 3 years ago

I was hoping to use rpm provided installations of slurmctld which installs slurmctld in /usr/sbin but your guide details $SLURM_PATH/var $SLURM_PATH/etc so it seems you are suggesting I use a compiled installation with an opt path defined instead? because normally if I install slurm-slurmctld the etc falls under /etc/slurm/

Again, the point of this project was to allow me and my team to develop for Slurm. In this situation you only want to build Slurm from sources.

The munge I was able to use successfully is 0.5.11, see the main project readme

Install Munge in MUNGE_PATH (under someuser). NOTE! that munge-0.5.11 
has problems with user-defined prefix installation 
(see https://code.google.com/p/munge/issues/detail?id=34 for the details). 
In the mentioned issue report you may find the patch that temporally fixes this problem.
Or you can use more recent versions that have this problem fixed.

Note that the above version was patched, but the link seems to be invalid and I don't remember what was the issue back then. Yet I've reported it to the munge developer and I believe it was fixed.

I was building from sources - not rpm.src, but from the tarball obtained from the

jelmd commented 1 year ago

Wondering, how resource control works in these containers. At least cgroup stuff seems not to work:

[2023-09-29T14:23:54.990] launch task StepId=60.0 request from UID:101 GID:10 HOST:10.3.0.64 PORT:44418
[2023-09-29T14:23:54.990] task/affinity: lllp_distribution: JobId=60 implicit auto binding: cores, dist 8192
[2023-09-29T14:23:54.990] task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic 
[2023-09-29T14:23:54.990] task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [60]: mask_cpu, 0x0000100001
[2023-09-29T14:23:55.136] [60.0] error: common_file_write_uints: write value '2495' to '/sys/fs/cgroup/cgroup.procs' failed: Device or resource busy
[2023-09-29T14:23:55.137] [60.0] error: Unable to move pid 2495 to init root cgroup /sys/fs/cgroup
[2023-09-29T14:23:55.137] [60.0] done with job

Any ideas, how to fix?

artpol84 commented 1 year ago

Hi, @jelmd

Thank you for your interest in the project. From the log you posted, it seems like the Slurm instance running inside the container (meaning already inside a group) is trying to use groups to restrict the job's processes. Is that correct?

If so - it seems like this requires nesting of cgroups which it seems like supported based on the description here:

 Limiting the number of descendant cgroups
       Each cgroup in the v2 hierarchy contains the following files,
       which can be used to view and set limits on the number of
       descendant cgroups under that cgroup:

       cgroup.max.depth (since Linux 4.14)
              This file defines a limit on the depth of nesting of
              descendant cgroups.  A value of 0 in this file means that
              no descendant cgroups can be created.  An attempt to
              create a descendant whose nesting level exceeds the limit
              fails (mkdir(2) fails with the error EAGAIN).

              Writing the string "max" to this file means that no limit
              is imposed.  The default value in this file is "max" .

       cgroup.max.descendants (since Linux 4.14)
              This file defines a limit on the number of live descendant
              cgroups that this cgroup may have.  An attempt to create
              more descendants than allowed by the limit fails (mkdir(2)
              fails with the error EAGAIN).

              Writing the string "max" to this file means that no limit
              is imposed.  The default value in this file is "max".

The EAGAIN error is likely to translate to the "Device or resource busy" that you see in your log.

We've not seen this issue as for the Slurm development purposes SLXC was created for, we were not touching cgroups and this plugin was disabled.

P.S. If you don't mind sharing your experience, what is your usecase for slxc?

jelmd commented 1 year ago

Hi @artpol84,

thanx for your answer. Did some more experiments and found out, that it seems to work more or less if the right settings are made. But the documentation is so shallow and confusing (and sometimes IMHO even wrong)...

I guess, the mentioned errors are really cleanup v23.2.5 bugs: found out, that they get triggered when the job is finished and that the related job cgroups (like /sys/fs/cgroup/system.slice/${nodename}_slurmstepd.scope/job_82) get never cleaned up - stay forever in the system (I guess because of the wrong move).

Use case: Actually I do everything using LXCs (not any docker non-sense). So especially for our DL users we create projects, and dedicate 1+ LXCs to those projects with the appropriate number of GPUs (as needed/as available on the bare metal). Works really good and users are happy, however, sometimes studs do not use their LXCs 24/7 and others would like to have some more GPUs available from time to time. So the idea came out, to give slurm a try. Unfortunately isolation is an issue (and the current comfort as well), so trying to dig a little bit deeper ...