Snakemake-Profiles / lsf

Snakemake profile for running jobs on an LSF cluster
MIT License
36 stars 22 forks source link

resources: mem_mb= specification doesn't change memory requirement for LSF job submission #41

Open ipstone opened 3 years ago

ipstone commented 3 years ago

Hello,

I am not sure whether there's something wrong in my Snakefile setup: When I use

            resources:
                mem_mb=32, time_min = 300
   The time limit is correctly passed to LSF (bsub) jobs, but memory
   configue is still using the memory limit set in the snake-profiles.

Is there someway to change the memory requirement for the run, without changing the snakemake-profile? If there's a quick way to edit the profile/to make the change, that would be a valuable solution for me as well (right now using cookie cutter to create a new profile works for me, but it seems a direct editing on the profile, could be quicker/more direct?)

Thanks a lot

Isaac

ipstone commented 3 years ago

Another quick/related newbie question: the memory requirement is per job, or per threads? say if I ask for 8GB as memory limit, will I get 32GB in total for 4 theads, or each thread (among the 4) will only get 2GB? Thanks

leoisl commented 3 years ago

hello, the memory limit is per job. So if you ask for 8 GB and you are running 4 threads, if the process altogether is consuming at most 8 GB, it is fine, otherwise it will be killed.

The mem_mb passed being the default one of the profile is a weird issue, I am guessing your profile is configured to use GB for LSF_UNIT_FOR_LIMITS, and then we calculate the memory being 0.032GB for your job, but then round up to 1GB? i.e.:

https://github.com/Snakemake-Profiles/lsf/blob/34c3c4c462d3a2070643a00033815f30bfd105e0/%7B%7Bcookiecutter.profile_name%7D%7D/lsf_submit.py#L87-L88

Could you post the config.yaml of your profile to help us debugging it?

ipstone commented 3 years ago

Thanks @leoisl

Here is my config.yaml, please take a look:

latency-wait: "5"
jobscript: "lsf_jobscript.sh"
use-conda: "True"
use-singularity: "False"
printshellcmds: "True"
restart-times: "0"
jobs: "500"
cluster: "lsf_submit.py"
cluster-status: "lsf_status.py"
max-jobs-per-second: "10"
max-status-checks-per-second: "10"

Here is the CookieCutter.py content:

class CookieCutter:
    """
    Cookie Cutter wrapper
    """

    @staticmethod
    def get_default_threads() -> int:
        return int("8")

    @staticmethod
    def get_default_mem_mb() -> int:
        return int("16384")

    @staticmethod
    def get_log_dir() -> str:
        return "logs/cluster"

    @staticmethod
    def get_default_queue() -> str:
        return ""

    @staticmethod
    def get_lsf_unit_for_limits() -> str:
        return "GB"

    @staticmethod
    def get_unknwn_behaviour() -> str:
        return "wait"

    @staticmethod
    def get_zombi_behaviour() -> str:
        return "ignore"

    @staticmethod
    def get_latency_wait() -> float:
        return float("5")

Additionally, I noticed some discrepancies between lsf job description obtained through bjobs -l vs. the information from the snakemake logs file, for example: bjobs -l gives:


Job <1155966>, Job Name <gen_sigs.genetics_type=genetics_exon>, User <ipstone>, Project 
                     <default>, Application <default>, Status <RUN>, Queue <cpu
                     queue>, Job Priority <12>, Command </cluster/data/lab/pro
                     jects/ipstone/genetics_project/.snakemake/tmp.cunieacm/snakejob.
                     gen_sigs.6.sh>, Share group charged </ipstone>, Esub <memlimi
                     t>
Wed May 19 18:01:37: Submitted from host <lx01>, CWD </cluster/data/lab/projec
                     ts/ipstone/genetics_project>, Output File <logs/cluster/gen_sigs
                     /genetics_type=genetics_exon/jobid6_c5692471-3cd3-4544-8e62-97db
                     5dd49fbd.out>, Error File <logs/cluster/gen_sigs/genetics_typ
                     e=genetics_exon/jobid6_c5692471-3cd3-4544-8e62-97db5dd49fbd.e
                     rr>, 4 Task(s), Requested Resources <select[mem>17] rusage
                     [mem=17] span[hosts=1]>;
Wed May 19 18:01:38: Started 4 Task(s) on Host(s) <4*lt06>, Allocated 4 Slot(s)
                      on Host(s) <4*lt06>, Execution Home </home/ipstone>, Executi
                     on CWD </cluster/data/lab/projects/ipstone/genetics_project>;
Wed May 19 20:23:33: Resource usage collected.
                     The CPU time used is 33106 seconds.
                     MEM: 11 Gbytes;  SWAP: 0 Gbytes;  NTHREAD: 84
                     PGID: 51801;  PIDs: 51801 51802 51804 51805 51810 51811 
                     51845 71958 71959 71960 71961 71962 71963 71967 71969 
                     71970 71974 71975 71979 71983 71987 71988 71998 71999 
                     72000 72001 72002 72006 72013 72014 72015 72016 72018 
                     72022 72023 72024 72028 72029 72030 72031 72035 72036 
                     72037 72041 72042 72043 72047 72048 72049 72053 72061 
                     72068 72075 72085 72098 72109 72119 72132 72139 72143 
                     72151 72161 72209 72214 72218 72219 72226 72229 72234 
                     72235 72236 72237 72245 72252 72265 72273 72279 72283 
                     72290 

 RUNLIMIT                
 1440.0 min

 MEMLIMIT
     17 G 

 MEMORY USAGE:
 MAX MEM: 11 Gbytes;  AVG MEM: 10 Gbytes

whereas, in the snakemake logs file (the same job but previously exited as wrong input file name given):

Resource usage summary:

    CPU time :                                   5.23 sec.
    Max Memory :                                 -
    Average Memory :                             -
    Total Requested Memory :                     68.00 GB
    Delta Memory :                               -
    Max Swap :                                   -
    Max Processes :                              6
    Max Threads :                                14
    Run time :                                   23 sec.
    Turnaround time :                            23 sec.

It seems the jobs submitted by snakemake did multiple my memory request of 16GB x 4 in the snakemake logs file, but bjobs -l command shows the memory limit is 16GB (which is stated in your answer). My guess is that there might be a misreporting in the logs.

Thanks a lot for the quick reply!

lastly, here is our LSF version:

IBM Spectrum LSF Standard 10.1.0.10, Jun 23 2020
pedrofale commented 3 years ago

I was having the same issue, so I just changed lsf_submit.py to not set the -M option in this line: https://github.com/Snakemake-Profiles/lsf/blob/34c3c4c462d3a2070643a00033815f30bfd105e0/%7B%7Bcookiecutter.profile_name%7D%7D/lsf_submit.py#L90 I'm not sure what's the point of having that there?

mfansler commented 2 years ago

hello, the memory limit is per job. So if you ask for 8 GB and you are running 4 threads, if the process altogether is consuming at most 8 GB, it is fine, otherwise it will be killed.

@leoisl Please be aware that this is not universally true. The LSF system I use definitely interprets the argument as per thread.

leoisl commented 2 years ago

https://github.com/Snakemake-Profiles/lsf/issues/50 and https://github.com/Snakemake-Profiles/lsf/issues/51 should fix this