NCAR / ncar-python-tutorial

Numerical & Scientific Computing with Python Tutorial
https://ncar.github.io/ncar-python-tutorial
Creative Commons Attribution 4.0 International
67 stars 33 forks source link

update jobqueue.yaml #14

Closed matt-long closed 5 years ago

matt-long commented 5 years ago

dashboard has changed names.

Is the slurm configuration actually a good one?

matt-long commented 5 years ago

I think the SLURM config is reasonable, however, I would like to support good defaults for Hobart.

Hobart has 48 cores per node: http://www.cgd.ucar.edu/systems/documentation-toc/02.11.03_-_HPC_Cluster.html

resource_specneeds to be specified inPBSCluster`. I was able to get it working with the following.

cluster = dask_jobqueue.PBSCluster(cores=48, 
                                   processes=48, 
                                   walltime='08:00:00',
                                   memory='96GB', queue='medium',
                                   resource_spec='nodes=1:ppn=48',
                                   job_extra={'-r n'})

We need to have ./config/jobqueue-cheyenne.yaml ./config/jobqueue-hobart.yaml

and accept a machine argument in copy_config.

andersy005 commented 5 years ago

Is it safe to close this issue since it was fixed in https://github.com/NCAR/ncar-jobqueue/pull/12?

jhamman commented 5 years ago

Just popping in to encourage you all to suggest edits to /glade/u/apps/config/dask/dask.yaml via the cisl help desk.

This file currently looks like:

distributed:
  scheduler:
    bandwidth: 1000000000     # GB MB/s estimated worker-worker bandwidth
  worker:
    memory:
      target: 0.90  # Avoid spilling to disk
      spill: False  # Avoid spilling to disk
      pause: 0.80  # fraction at which we pause worker threads
      terminate: 0.95  # fraction at which we terminate the worker
  comm:
    compression: null

jobqueue:
  pbs:
    name: dask-worker

    # Dask worker options
    cores: 1                    # Total number of cores per job
    memory: '3 GB'              # Total amount of memory per job
    processes: 1                # Number of Python processes per job

    interface: ib0              # Network interface to use like eth0 or ib0

    # PBS resource manager options
    queue: share
    walltime: '00:30:00'
    resource-spec: select=1

  slurm:
    name: dask-worker
    # Dask worker options
    cores: 1                    # Total number of cores per job
    memory: '25 GB'             # Total amount of memory per job
    processes: 1                # Number of Python processes per job

    interface: ib0              # Network interface to use like eth0 or ib0

    # SLURM resource manager options
    walltime: '00:30:00'
    job-extra: {-C skylake}

but if you feel like there are more reasonable default values, we can suggest edits.