NCAR / CUPiD

CUPiD is a “one stop shop” that enables and integrates timeseries file generation, data standardization, diagnostics, and metrics from all CESM components.
https://ncar.github.io/CUPiD/
Apache License 2.0
24 stars 22 forks source link

/glade/scratch #81

Open dabail10 opened 6 months ago

dabail10 commented 6 months ago

Describe the bug The /glade/scratch partition is not available and at least one of the notebooks points there.

To Reproduce cupid-run config.yml

Expected behavior The following message:

PermissionError: [Errno 13] Permission denied: '/glade/scratch'

ploomber.exceptions.TaskBuildError: Error when executing task 'ocean_surface'. Partially executed notebook available at /glade/u/home/dbailey/CUPiD/examples/coupled_model/computed_notebooks/quick-run/ocean_surface.ipynb ploomber.exceptions.TaskBuildError: Error building task "ocean_surface" ===================================================== Summary (1 task) ===================================================== NotebookRunner: ocean_surface -> File('computed_notebook...cean_surface.ipynb') ===================================================== DAG build failed =====================================================

Additional context There are a number of paths hard coded to /glade/scratch in mom-tools.

mnlevy1981 commented 6 months ago

I think the issue is that mom6-tools uses ncar-jobqueue, and the default configuration for that package points to /glade/scratch/. Do you have a ~/.config/dask/ncar-jobqueue.yaml file on glade? If so, there's probably a block like

casper-dav:
  pbs:
    #    project: XXXXXXXX
    name: dask-worker-casper-dav
    cores: 1 # Total number of cores per job
    memory: '10GB' # Total amount of memory per job
    processes: 1 # Number of Python processes per job
    interface: ext
    walltime: '01:00:00'
    resource-spec: select=1:ncpus=1:mem=25GB
    queue: casper
    log-directory: '/glade/derecho/scratch/${USER}/dask/casper-dav/logs'
    local-directory: '/glade/derecho/scratch/${USER}/dask/casper-dav/local-dir'
    job-extra: []
    env-extra: []
    death-timeout: 60

Where I've already updated log-directory and local-directory to use /glade/derecho/scratch but your version may specify /glade/scratch instead. Another place to look is ~/.dask/jobqueue.yaml, where the block is

jobqueue:
  pbs:
    cores: 1
    interface: ext
    job-extra: []
    local-directory: /glade/derecho/scratch/mlevy
    log-directory: /glade/derecho/scratch/mlevy
    memory: 10GiB
    name: dask-worker
    processes: 1
    queue: regular
    resource-spec: select=1:ncpus=1:mem=10GB
    walltime: 01:00:00

and again, I've updated log-directory and local-directory.

dabail10 commented 6 months ago

Got it. Should I just wipe out that whole directory? When did it get created?

mnlevy1981 commented 6 months ago

I would just modify those two files (or whichever of them exist) to make sure the path is correct

mnlevy1981 commented 6 months ago

(while you're at it, make sure interface is ext instead of ib0)

dabail10 commented 6 months ago

There is no setting for derecho in these files and there is still a hobart setting. How does it get created? We should wipe this directory out and make sure everyone gets a fresh version.

mnlevy1981 commented 6 months ago

I'm not sure how it gets created, hence my reluctance to remove it :) I noticed the lack of derecho settings, but CUPiD runs fine on derecho so I don't think it's an issue. Instead of outright deleting it, can you rename it and see if it's recreated (or if CUPiD runs without it)?

dabail10 commented 6 months ago

Interesting. I deleted the ~/.config/dask directory and it got recreated when I reran the cupid-run. Or more accurately, I also wiped out the computed notebooks and then it recreated this. The ncar-jobqueue.yml file is out of date. This must be coming from a CISL file somewhere.