OSC / ood_core

Open OnDemand core library
https://osc.github.io/ood_core/
MIT License
10 stars 30 forks source link

LSF bjobs for all users hangs #745

Open vallerul opened 2 years ago

vallerul commented 2 years ago

Active jobs app, for all users will hang when LSF runs thousands of jobs, and the active history in LSF is kept for days instead of hours. CLEAN_PERIOD in LSF configuration controls how much data bjobs retrieve. CLEAN_PERIOD is usually a day, but when increased to 3 days , it caused a forever hang. I see that the issue is because of bjobs arguments in lib/ood_core/job/adapters/lsf/batch.rb :

def get_jobs_for_user(user) args = %W( -u #{user} -a -w -W ) parse_bjobs_output(call("bjobs", *args)) end bjobs -u all -a -w -W is very resource intensive when thousands of jobs are scheduled, and almost can take forever to return.

I had to make the following change ( remove -a ) to make it respond:

def get_jobs_for_user(user) args = %W( -u #{user} -w -W ) parse_bjobs_output(call("bjobs", *args)) end

It would be good to keep the above configurable, instead of making the change in code.

┆Issue is synchronized with this Asana task by Unito

johrstrom commented 2 years ago

Thanks. Looks like it's here https://github.com/OSC/ood_core/blob/01de647f2ee3ac3b218afa89581126166c7a829d/lib/ood_core/job/adapters/lsf/batch.rb#L40

and here https://github.com/OSC/ood_core/blob/01de647f2ee3ac3b218afa89581126166c7a829d/lib/ood_core/job/adapters/lsf/batch.rb#L49

Does bjobs respond to the environment variable CLEAN_PERIOD?

vallerul commented 2 years ago

As far as I remember, it cannot be used as an environment variable. CLEAN_PERIOD is part of lsb.params configuration file, and is usually set as part of scheduler policies.

https://www.ibm.com/support/pages/how-increase-default-retention-job-information-lsf-memory.

johrstrom commented 2 years ago

hmmm ok. yea it seems like we could default to false (not using the flag) and folks can enable it if they choose.

I can't recall what torque did, but Slurm doesn't keep job info around for very long.