ENCODE-DCC / caper

Cromwell/WDL wrapper for Python
MIT License
49 stars 18 forks source link

how to set up LSF specific conf file for caper #93

Open ls233 opened 3 years ago

ls233 commented 3 years ago

Hi Jin,

I'm looking for the right value of a platform parameter to be specified to init Caper on my HPC (Mount Sinai). My HPC uses the LSF system. I'm referring to section 2.3 of this manual - https://github.com/MoTrPAC/motrpac-atac-seq-pipeline.

Thanks, -- German Nudelman, Ph.D. Sr. Bioinformatics Developer/Analyst Icahn School of Medicine at Mount Sinai

leepc12 commented 3 years ago

That link doesn't work. Does your LSF cluster have a wiki page?

ls233 commented 3 years ago

well, what I'm basically asking you is what platform should I specify for this: caper init [PLATFORM]

I don't think my HPC has a wiki, but here https://labs.icahn.mssm.edu/minervalab/lsf-queues/ is some description.

On Tue, Sep 22, 2020 at 1:52 PM Jin Lee notifications@github.com wrote:

That link doesn't work. Does your LSF cluster have a wiki page?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ENCODE-DCC/caper/issues/93#issuecomment-696880501, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFPJDGTBBPDVDTMHZMCFEDSHDP7HANCNFSM4RVXZ47Q .

-- German Nudelman, Ph.D. Sr. Bioinformatics Developer/Analyst Icahn School of Medicine at Mount Sinai https://mssm.zoom.us/j/4969297959

leepc12 commented 3 years ago

Caper doesn't currently support LSF. If I can get some detailed info about bsub and monitoring command then I can add it to Caper later.

leepc12 commented 3 years ago

You may need to run Caper with local backend, which means that Caper will not bsub tasks. It will run all tasks on a current shell.

Login on a compute node and then run

caper run ATAC_WDL -i INPUT_JSON --singularity --max-concurrent-tasks 2

Use screen or nohup to keep the session on. Or bsub caper command line itself with very large resources.

If you want to save resources on a compute node, then serialize all tasks by using --max-concurrent-tasks 1.

ls233 commented 3 years ago

Thanks Jin for the suggestion.

For practical reasons, deploying a pipeline such as the ENCODE atac seq without the ability to submit jobs is somewhat of limited utility, unfortunately. It is especially relevant nowadays when the datasets may contain hundreds of samples. Whenever you have the resources, I'd be happy to work with you to add the LSF support to Casper, if needed .

Could you pls advise what would be a good starting point for this?

Best, German

On Tue, Sep 22, 2020 at 3:42 PM Jin Lee notifications@github.com wrote:

You may need to run Caper with local backend, which means that Caper will not bsub tasks. It will run all tasks on a current shell.

Login on a compute node and then run

caper run ATAC_WDL -i INPUT_JSON --singularity --max-concurrent-tasks 2

Use screen or nohup to keep the session on. Or bsub caper command line itself with very large resources.

If you want to save resources on a compute node, then serialize all tasks by using --max-concurrent-tasks 1.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ENCODE-DCC/caper/issues/93#issuecomment-696939322, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFPJDGYFSKQ6ZDG4MAIY5LSHD43PANCNFSM4RVXZ47Q .

-- German Nudelman, Ph.D. Sr. Bioinformatics Developer/Analyst Icahn School of Medicine at Mount Sinai https://mssm.zoom.us/j/4969297959

leepc12 commented 3 years ago

Sorry for the late reply, currently we don't have a plan to add a LSF backend. If you are familiar with python then you can start by modifying the PBS backend of Caper.

https://github.com/ENCODE-DCC/caper/blob/master/caper/cromwell_backend.py#L710

You need to modify bash command lines under keys submit, kill, check-alive and job-id-regex. For example replace qsub with bsub.

        'submit': dedent(
            """\
            if [ -z \\"$SINGULARITY_BINDPATH\\" ]; then export SINGULARITY_BINDPATH=${singularity_bindpath}; fi; \\
            if [ -z \\"$SINGULARITY_CACHEDIR\\" ]; then export SINGULARITY_CACHEDIR=${singularity_cachedir}; fi;
            echo "${if !defined(singularity) then '/bin/bash ' + script
                    else
                      'singularity exec --cleanenv ' +
                      '--home ' + cwd + ' ' +
                      (if defined(gpu) then '--nv ' else '') +
                      singularity + ' /bin/bash ' + script}" | \\
            qsub \\
                -N ${job_name} \\
                -o ${out} \\
                -e ${err} \\
                ${true="-lnodes=1:ppn=" false="" defined(cpu)}${cpu}${true=":mem=" false="" defined(memory_mb)}${memory_mb}${true="mb" false="" defined(memory_mb)} \\
                ${'-lwalltime=' + time + ':0:0'} \\
                ${'-lngpus=' + gpu} \\
                ${'-q ' + pbs_queue} \\
                ${pbs_extra_param} \\
                -V
        """
        ),
        'exit-code-timeout-seconds': 180,
        'kill': 'qdel ${job_id}',
        'check-alive': 'qstat ${job_id}',
        'job-id-regex': '(\\d+)',
HenryCWong commented 3 years ago

I'll probably be working on getting this to LSF soon. In the mean time this might help, it's old but the basic commands typically don't change that much. https://modelingguru.nasa.gov/docs/DOC-1040

HenryCWong commented 3 years ago

Should look something like this

class CromwellBackendlsf(CromwellBackendLocal):
    TEMPLATE_BACKEND = {
        'config': {
            'default-runtime-attributes': {'time': 24},
            'script-epilogue': 'sleep 5',
            'runtime-attributes': dedent(
                """\
                String? docker
                String? docker_user
                Int cpu = 1
                Int? gpu
                Int? time
                Int? memory_mb
                String? lsf_queue
                String? lsf_extra_param
                String? singularity
                String? singularity_bindpath
                String? singularity_cachedir
            """
            ),
            'submit': dedent(
                """\
                if [ -z \\"$SINGULARITY_BINDPATH\\" ]; then export SINGULARITY_BINDPATH=${singularity_bindpath}; fi; \\
                if [ -z \\"$SINGULARITY_CACHEDIR\\" ]; then export SINGULARITY_CACHEDIR=${singularity_cachedir}; fi;
                echo "${if !defined(singularity) then '/bin/bash ' + script
                        else
                          'singularity exec --cleanenv ' +
                          '--home ' + cwd + ' ' +
                          singularity + ' /bin/bash ' + script}" | \\
                bsub \\
                    -J ${job_name} \\
                    -o ${out} \\
                    -e ${err} \\
                    ${true="-n=" false="" defined(cpu)}${cpu} \\
                    ${true="-R 'rusage[mem=" false="" defined(memory_mb)}${memory_mb} ${true="mb]'" false="" defined(memory_mb)} \\
                    ${'-W=' + time + ':0'} \\
                    ${'-q ' + lsf_queue} \\
                    ${lsf_extra_param} \\
                    -V
            """
            ),
            'exit-code-timeout-seconds': 180,
            'kill': 'bkill ${job_id}',
            'check-alive': 'bjobs ${job_id}',
            'job-id-regex': '(\\d+)',
        }
    }

    def __init__(
        self,
        local_out_dir,
        max_concurrent_tasks=CromwellBackendBase.DEFAULT_CONCURRENT_JOB_LIMIT,
        soft_glob_output=False,
        local_hash_strat=CromwellBackendLocal.DEFAULT_LOCAL_HASH_STRAT,
        lsf_queue=None,
        lsf_extra_param=None,
    ):
        super().__init__(
            local_out_dir=local_out_dir,
            backend_name=BACKEND_LSF,
            max_concurrent_tasks=max_concurrent_tasks,
            soft_glob_output=soft_glob_output,
            local_hash_strat=local_hash_strat,
        )
        self.merge_backend(CromwellBackendLSF.TEMPLATE_BACKEND)
        self.backend_config.pop('submit-docker')

        if lsf_queue:
            self.default_runtime_attributes['lsf_queue'] = lsf_queue
        if LSF_extra_param:
            self.default_runtime_attributes['LSF_extra_param'] = lsf_extra_param```

Note: have not tested this

I got rid of GPU because GPU use is dependent on LSF implmenetation.

However, @leepc12 how is "job-id-regex" grabbed in PBS? I'm not completely familiar with how job id's are grabbed from PBS so any insight on this would be much appreciated. It shouldn't be too difficult to construct a regex.

HenryCWong commented 3 years ago

Never mind, went threw cromwell docs and found this


LSF {
  actor-factory = "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory"
  config {
    submit = "bsub -J ${job_name} -cwd ${cwd} -o ${out} -e ${err} /usr/bin/env bash ${script}"
    kill = "bkill ${job_id}"
    check-alive = "bjobs ${job_id}"
    job-id-regex = "Job <(\\d+)>.*"
  }
}

so everything is fairly similar

HenryCWong commented 3 years ago

I am modifying the PBS backend at https://github.com/ENCODE-DCC/caper/blob/master/caper/cromwell_backend.py but when I run caper I still get an error saying I am using qsub. Any help would be appreciated.

HenryCWong commented 3 years ago

For future reference this is the LSF backend file that was ran that worked. You cluster may not have a -G or -q flag so adjust as you may. I also had to set up specific paths (the PATHS="" and LSF_DOCKER_VOLUMES="") for my LSF call. If your compute cluster or system is different you'll want to take those out too. When you run caper just run --backend-file name_of_backendfile.conf. Thanks to @leepc12 for helping me set this up.


backend {
  providers {
    pbs {
      config {
        submit = """if [ -z \"$SINGULARITY_BINDPATH\" ]; then export SINGULARITY_BINDPATH=${singularity_bindpath}; fi; \
if [ -z \"$SINGULARITY_CACHEDIR\" ]; then export SINGULARITY_CACHEDIR=${singularity_cachedir}; fi;

echo "${if !defined(singularity) then '/bin/bash ' + script
        else
          'singularity exec --cleanenv ' +
          '--home ' + cwd + ' ' +
          (if defined(gpu) then '--nv ' else '') +
          singularity + ' /bin/bash ' + script}" | \

PATH="/opt/juicer/CPU/common:/opt/hic-pipeline/hic_pipeline:$PATH" LSF_DOCKER_VOLUMES="/storage1/fs1/dspencer/Active:/storage1/fs1/dspencer/Active" \
bsub \
    -J ${job_name} \
    -o ${out} \
    -e ${err} \
    ${true="-n " false="" defined(cpu)}${cpu} \
    ${true="-M" false="" defined(memory_mb)}${memory_mb}${true="MB" false="" defined(memory_mb)} \ \
    ${'-W' + time + ':0:0'} \
    ${'-q ' + pbs_queue} \
    -G compute-group \
    ${pbs_extra_param} \
"""
        kill = "bkill ${job_id}"
        check-alive = "bjobs ${job_id}"
        job-id-regex = "(\\d+)"
      }
    }
  }
}```
ernstki commented 2 years ago

Hi everyone. I'm charged with standing up the ENCODE ATAC-seq pipeline to work in our environment, which is LSF, and I'm willing to take the baton across the finish line with GitHub pull request to see LSF supported out of the box for all users of Caper.

@HenryCWong, you have done most of the legwork already. If I can test your changes locally, and everything works for the two of us, at our two different sites, is there a way I can walk you through how to do a PR on GitHub, or... are you clear on how to do that? Do you have the time?

It would be a shame for you not to get credit, if it gets merged into the codebase.

leepc12 commented 2 years ago

@ernstki: Please let me make a dev PR for you and you can pull it (you may need to git pull the test branch and add the git directory to PYTHONPATH so that pip-installed one is ignored) and test on your clusters.

All I need a working custom backend file (--backend-file) that works most of LSF clusters. Then you will be able to use caper init lsf and just define required parameters in the conf.(~/.caper/default.conf).

If that works for you two @ernstki and @HenryCWong then I can merge it to master.

HenryCWong commented 2 years ago

Hi sorry for the late response y'all. So do you still want me to take the PR since @leepc12 is making a dev PR?

I can get you the custom lsf backend file tomorrow. The one above should work but I also haven't been in here in 2 months so I'll double check things.

ernstki commented 2 years ago

@HenryCWong If you're willing to just

…I'm willing to cherry-pick that commit from your fork and do any remaining work to get it in a state that meets @leepc12's requirements.

This way you'll get credit for the work you've done in the Git commit history for Caper, and you will be Internet Famous. ;) If that kind of fame has no great appeal for you, I can just copy-paste what you have above instead, I will credit you in the relevant commit message, and you can forget about the forking and all that.

I think we can discuss whether @leepc12 wants to put custom backends in a contrib subdirectory and other details like that in the PR.

HenryCWong commented 2 years ago

Thanks for the info. I forked the and made a commit here https://github.com/HenryCWong/caper.

lauragails commented 2 years ago

I opened my password manager to log in to specifically thank you for doing this.

(I am another bioinformatician at Mount Sinai, on the same computing environment, that needed this fix)

HenryCWong commented 2 years ago

It seems IBM has been customizing specific LSF things for customers so if it doesn't work for you guys and you need to do run caper with a custom backend I can try to help out.

lauragails commented 2 years ago

Thank you so much!

On Tue, Sep 21, 2021 at 11:38 AM Henry C. Wong @.***> wrote:

It seems IBM has been customizing specific LSF things for customers so if it doesn't work for you guys and you need to do run caper with a custom backend I can try to help out.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ENCODE-DCC/caper/issues/93#issuecomment-924109251, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVNCLOIVAAPNTNMHKYRNU3UDCRG5ANCNFSM4RVXZ47Q . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

ernstki commented 2 years ago

It looks like #148 (release 2.0.0) implements LSF support, so thanks @leepc12!

Not sure if that's based on what @HenryCWong shared here or not, but it looks like this issue could be closed if v2.0.0 meets @ls233's requirements.

The project I needed this for is not nearing the stage where it's ready to submit jobs to a cluster anyway, so I wouldn't have been to work on this for several weeks at least.