Azure / cyclecloud-pbspro

Example Azure CycleCloud PBSpro cluster type
MIT License
12 stars 20 forks source link

Modifing stack softlimit #17

Open Klaas- opened 4 years ago

Klaas- commented 4 years ago

Hi, I've noticed cyclecloud recently changed the behavior for limits of stack sizes. Now it add this:

$ cat /etc/security/limits.conf |grep stack
#        - stack - max stack size (KB)
*               hard    stack           unlimited
*               soft    stack           unlimited

However I am not sure where this comes from, I can't find it in this repo and it is not from the CentOS HPC Image as far as I could tell (https://github.com/openlogic/AzureBuildCentOS)

In any case if someone else is falling over this, Abaqus at least does not accept unlimited as a soft limit.

Greetings Klaas

anhoward commented 4 years ago

Hi Klaas, I just double-checked and I can't find anywhere in our code that's making that change. I also checked on a vanilla Slurm cluster deployed with CycleCloud 7.9.3 and don't see the stack size change that you're seeing. When we do modify limits, we put those changes in /etc/security/limits.d/cyclecloud.conf, but the only modifications there are increasing the number of open files. Nothing to do with stack size. Could another package you're installing either via a cluster-init project or via a custom image be adding that?

Thanks, -Andy

anhoward commented 4 years ago

Hi Klaas, I just realized after my last comment that this is the PBSpro repo, not Slurm (which is where I've spent most of my time lately). Sure enough, I can reproduce this with a fresh CycleCloud PBSpro cluster. I'll look through our recipes more closely, but I'm not aware of any changes we made to limits recently.

When you say the behavior "recently changed", do you know what version you upgraded from? It's possible if you were previously using a version that had an older PBSpro installation that maybe their packages changed to increase the stack limit. The other possibility is that one of the dependency packages has updated to make this change.

One thing you could do as a workaround would be to set the stack limit explicitly in your job script. Just doing ulimit -s <int> will set the stack size lower than the hard limit. That may get your Abaqus jobs working again.

Klaas- commented 4 years ago

@anhoward during the last ~2 months, I did not update the cyclecloud version, that's why I think this is from some content that is being downloaded on the fly.

My last cyclecloud update: Name : cyclecloud Version : 7.9.2 Install Date: Thu 23 Jan 2020 10:25:08 AM UTC

I know how to work around the problem, the issue is more that this change seems to be a silent one, I am fairly sure my master install worked after the 7.9.2 update, and stopped working a couple of days ago when I tried out the HB120v2 machines - this first lead me to believe it is an issue related to the machine type until I figured out that abaqus is so stupid it can't deal with unlimited stacksize softlimits....

In general I would be interested where the modification is coming from, I could not find it in the installation here, or in the OS image which would be my first candidates to look. Are your 'common' chef modules also located on github?

Greetings Klaas