Nextomics / NextDenovo

Fast and accurate de novo assembler for long reads
GNU General Public License v3.0
352 stars 52 forks source link

LSF memory request reports a bug and the fixing func #63

Closed haoliu1213 closed 4 years ago

haoliu1213 commented 4 years ago

Hi, when i use -R rusage[mem={vf}] -M {vf} to reques memory reservation and control in lsf, the software is failed. After test fror times, i found the lsf-drmaa can't accept memory request string with unit e.g GB MB, it only takes numer, so i add a function to fix this in the task_control.py(Run class , init func from line 152), the NextDenovo version is 2.1-beta.0.

def lsf_mem(mem):
    import re
    LSF_UNIT_FOR_LIMITS="MB" #lsf default unit, which is defined by LSF system, ours is 'MB', change it as you like

    if re.search('K',LSF_UNIT_FOR_LIMITS):
        if mem.isalnum:
            pass
        if re.search('K',mem,re.I):
            mem = re.search('\d+',mem).group()
            mem = int(mem)
        if re.search('M',mem,re.I):
            mem = re.search('\d+',mem).group() 
            mem = int(mem) * 1024
        elif re.search('G',mem,re.I):
            mem = re.search('\d+',mem).group() 
            mem = int(mem) * 1024 * 1024
        elif re.search('T',mem,re.I):
            mem = re.search('\d+',mem).group()
            mem = int(mem) * 1024 * 1024 * 1024

    if re.search('M',LSF_UNIT_FOR_LIMITS,re.I):
        if mem.isalnum:
            pass
        if re.search('K',mem,re.I):
            mem = 1
        if re.search('M',mem,re.I):
            mem = re.search('\d+',mem).group()
            mem = int(mem)
        elif re.search('G',mem,re.I):
            mem = int(re.search('\d+',mem).group())
            mem = int(mem) * 1024
        elif re.search('T',mem,re.I):
            mem = re.search('\d+',mem).group()
            mem = int(mem) * 1024 *1024

    if re.search('G',LSF_UNIT_FOR_LIMITS,re.I):
        if mem.isalnum:
            pass
        if re.search('K',mem,re.I):
            mem = 1
        if re.search('M',mem,re.I):
            mem = 1
        elif re.search('G',mem,re.I):
            mem = re.search('\d+',mem).group() 
            mem = int(mem)
        elif re.search('T',mem,re.I):
            mem = re.search('\d+',mem).group()
            mem = int(mem) * 1024 

    return mem

 self.vf = str(vf) if vf else self.cpu + 'G'
 if self.job_type == 'lsf':
     self.vf = lsf_mem(self.vf)

the code is tough, maybe you guys have a better way to fix it, hope it.

moold commented 4 years ago

Thank you so much. Actually, this bug has been fixed in version v2.2-beta.0, by removing the {vf} option when running on a lsf system to avoid unexpected errors. Users can use the {cpu} option to control the number of total subtasks running on a computer node. Thank you again.

haoliu1213 commented 4 years ago

i see, but there is no way to request memory reservations if the {vf} is removed, it's easy to run out of memory in the sort stage, because other big memory job may run on the same node.

moold commented 4 years ago

Yes, because I do not have a lsf system and cannot debug it. So you means lfs sytem use -R rusage[mem={vf}] -M {vf} to control cpu and memory for a job? may include some type errors?

haoliu1213 commented 4 years ago

-n {cpu} control cpu, -R rusage[mem={vf}] meams this job will use {vf} memory, so the system will allocate node with >{vf} to this job, -M {vf} meams the system will kill the job if the job's memory usage is more than {vf}. after fixing it, i can run the whole pipeline successfully without the -dbuf option, otherwise some nodes will stuck because of 'run of memory' in the correction stage.

moold commented 4 years ago

ok, thanks.

moold commented 4 years ago

Hi, could you help me to write some codes about getting the value of LSF_UNIT_FOR_LIMITS automatically, because I cannot find a lsf system.

haoliu1213 commented 4 years ago

ok, i will have a try.

haoliu1213 commented 4 years ago

kit.py

##add func##
def lsf_mem(mem):
    import re, os

    LSF_UNIT_FOR_LIMITS = ""  #lsf default unit, which is defined by LSF system,
    LSF_CONF_BASENAME = "lsf.conf"
    LSF_CONF_FILEPATH =  os.getenv('LSF_ENVDIR') + "/" + LSF_CONF_BASENAME
    with open(LSF_CONF_FILEPATH, 'r') as f:
        LSF_UNIT_FOR_LIMITS = re.search('LSF_UNIT_FOR_LIMITS=(\S+)',f.read(), re.M).group(1)
    if not LSF_UNIT_FOR_LIMITS:
        LSF_UNIT_FOR_LIMITS="MB"

    if re.search('K',LSF_UNIT_FOR_LIMITS,re.I):
        if re.search('K',mem,re.I):
            mem = re.search('\d+',mem).group()
            mem = int(mem)
        if re.search('M',mem,re.I):
            mem = re.search('\d+',mem).group() 
            mem = int(mem) * 1024
        elif re.search('G',mem,re.I):
            mem = re.search('\d+',mem).group() 
            mem = int(mem) * 1024 * 1024
        elif re.search('T',mem,re.I):
            mem = re.search('\d+',mem).group()
            mem = int(mem) * 1024 * 1024 * 1024

    if re.search('M',LSF_UNIT_FOR_LIMITS,re.I):
        if re.search('K',mem,re.I):
            mem = 1
        if re.search('M',mem,re.I):
            mem = re.search('\d+',mem).group()
            mem = int(mem)
        elif re.search('G',mem,re.I):
            mem = int(re.search('\d+',mem).group())
            mem = int(mem) * 1024
        elif re.search('T',mem,re.I):
            mem = re.search('\d+',mem).group()
            mem = int(mem) * 1024 *1024

    if re.search('G',LSF_UNIT_FOR_LIMITS,re.I):
        if re.search('K',mem,re.I):
            mem = 1
        if re.search('M',mem,re.I):
            mem = 1
        elif re.search('G',mem,re.I):
            mem = re.search('\d+',mem).group() 
            mem = int(mem)
        elif re.search('T',mem,re.I):
            mem = re.search('\d+',mem).group()
            mem = int(mem) * 1024 

    return str(mem)

task_control.py

 self.vf = str(vf) if vf else self.cpu + 'G'
 ##add code##
 if self.job_type == 'lsf':
     self.vf = lsf_mem(self.vf)
moold commented 4 years ago

Ok, thank you. I will add it in the next version, but I may make some changes.