Open vallerul opened 2 years ago
Thanks. Looks like it's here https://github.com/OSC/ood_core/blob/01de647f2ee3ac3b218afa89581126166c7a829d/lib/ood_core/job/adapters/lsf/batch.rb#L40
Does bjobs
respond to the environment variable CLEAN_PERIOD
?
As far as I remember, it cannot be used as an environment variable. CLEAN_PERIOD is part of lsb.params configuration file, and is usually set as part of scheduler policies.
https://www.ibm.com/support/pages/how-increase-default-retention-job-information-lsf-memory.
hmmm ok. yea it seems like we could default to false (not using the flag) and folks can enable it if they choose.
I can't recall what torque did, but Slurm doesn't keep job info around for very long.
Active jobs app, for all users will hang when LSF runs thousands of jobs, and the active history in LSF is kept for days instead of hours. CLEAN_PERIOD in LSF configuration controls how much data bjobs retrieve. CLEAN_PERIOD is usually a day, but when increased to 3 days , it caused a forever hang. I see that the issue is because of bjobs arguments in lib/ood_core/job/adapters/lsf/batch.rb :
def get_jobs_for_user(user) args = %W( -u #{user} -a -w -W ) parse_bjobs_output(call("bjobs", *args)) end
bjobs -u all -a -w -W is very resource intensive when thousands of jobs are scheduled, and almost can take forever to return.I had to make the following change ( remove -a ) to make it respond:
def get_jobs_for_user(user) args = %W( -u #{user} -w -W ) parse_bjobs_output(call("bjobs", *args)) end
It would be good to keep the above configurable, instead of making the change in code.
┆Issue is synchronized with this Asana task by Unito