Issue requesting resources on non-exclusive nodes.

MatthewBM commented 4 years ago

Hello,

I am trying to request nodes with the following config:


config_comet_haswell = Config(
        executors=[
            HighThroughputExecutor(
                cores_per_worker=16,
                mem_per_worker=360,
                label='Comet_HTEX_largenode',
                address=address_by_query(),
                worker_logdir_root= '/home/mmadany/parsl_jobs',
                working_dir = '/oasis/scratch/comet/mmadany/temp_project/parsl',
                worker_port_range=(63105, 63114),
                #worker_ports='63105',
                provider=SlurmProvider(
                    'large-shared',
                    channel=SSHChannel(
                        hostname='comet.sdsc.xsede.org',
                        username='mmadany',                                  # <--- Update here
                        script_dir='/home/mmadany/parsl_scripts',   # <--- Update here
                    ),

                    launcher=SrunLauncher(),
                    # string to prepend to #SBATCH blocks in the submit                                                                                                                                                                            
                    # script to the scheduler                                                                                                                                                                                                      
                    scheduler_options='#SBATCH -A DDP140 --mem=361G',
                    # Command to be run before starting a worker, such as:                                                                                                                                                                         
                    # 'module load Anaconda; source activate parsl_env'.                                                                                                                                                                           
                    worker_init='echo "hello comet"; source /home/mmadany/.bashrc; conda activate parsl; nc -z -v -w5 pestilence.crbs.ucsd.edu 63105',    # <--- Update here to how you setup your env
                    walltime='47:15:00',
                    init_blocks=1,
                    max_blocks=32,
                    nodes_per_block=1,
                    exclusive=False
                ),
            )
        ]
    )

However I always get the message 'slurm: attempt to provision nodes by provider has failed', I've noticed that the submission script uses the Sbatch variable 'ntasks-per-node=1' which has two problems:

1: I would like it to request a minimum and maximum of 16 CPUs with that config, because I'm focusing on memory allocation per worker, however my slurm provider does not have the option to set cpus_per_node.


parsl.version.VERSION
Out[7]: '0.8.0'

SlurmProvider(cores_per_node=1)
Traceback (most recent call last):

  File "<ipython-input-8-934fb85d86a2>", line 1, in <module>
    SlurmProvider(cores_per_node=1)

TypeError: __init__() got an unexpected keyword argument 'cores_per_node'

I'm pretty sure for non-exlusive jobs, the sbatch variable should be ntasks= and not ntasks-per-node correct?

benclifford commented 4 years ago

For point 1, cores_per_node arrived in commit 636ea5986fe5d211ab3216b16fce6c22f8bf5554 a few days ago, and which hasn't been in any release yet. To get that, you could install the latest parsl from master, using something like pip install git+https://github.com/parsl/parsl.

ZhuozhaoLi commented 4 years ago

I think this is related to #660 . Perhaps a quick way to deal with that is to put any other slurm options like #SBATCH ntasks-per-node=X in the scheduler_options to overwrite the default one.

annawoodard commented 4 years ago

@ZhuozhaoLi I think this has been addressed in the latest commits and would recommend trying the latest first with pip install git+https://github.com/parsl/parsl as @benclifford suggested. @MatthewBM Let us know if that doesn't work for you as it should be addressed on our end.

MatthewBM commented 4 years ago

Hello @benclifford @ZhuozhaoLi @annawoodard Thank you for your response.

Yes I can request the nodes successfully now but unfortunately since I've updated to the most recent commit, I'm getting BadRegistration errors no matter what Config I use, including the one I have at the top of this thread. I've attached the log from the runinfo on the local computer.

parsl.zip

yadudoc commented 4 years ago

Can you check if the version numbers match on your system and that on comet ? Looks like a version mismatch.

MatthewBM commented 4 years ago

Ok thanks @yadudoc

I'm getting the node to run but it cancels soon after and no workers connect with the following config


    config_comet_compute = Config(
        executors=[
            HighThroughputExecutor(
                cores_per_worker=12,
                mem_per_worker=62,
                label='Comet_HTEX_fullcompute',
                address=address_by_query(),
                worker_logdir_root= '/home/mmadany/parsl_jobs',
                working_dir = '/oasis/scratch/comet/mmadany/temp_project/parsl',
                worker_port_range=(63105, 63114),
                #worker_ports='63105',
                provider=SlurmProvider(
                    'shared',
                    channel=SSHChannel(
                        hostname='comet.sdsc.xsede.org',
                        username='mmadany',                                  # <--- Update here
                        script_dir='/home/mmadany/parsl_scripts',   # <--- Update here
                    ),
                    cores_per_node=12,
                    launcher=SrunLauncher(),
                    # string to prepend to #SBATCH blocks in the submit                                                                                                                                                                            
                    # script to the scheduler                                                                                                                                                                                                      
                    scheduler_options='#SBATCH -A ddp140 --mem=63G',
                    # Command to be run before starting a worker, such as:                                                                                                                                                                         
                    # 'module load Anaconda; source activate parsl_env'.                                                                                                                                                                           
                    worker_init='echo "hello comet"; source /home/mmadany/.bashrc; conda activate parsl; nc -z -v -w5 pestilence.crbs.ucsd.edu 63105',    # <--- Update here to how you setup your env
                    walltime='12:00:00',
                    init_blocks=1,
                    max_blocks=1,
                    nodes_per_block=1,
                    exclusive=False
                ),
                #controller=Controller(public_ip=address_by_query(),port="63105"),
            )
        ]
    )

MatthewBM commented 4 years ago

From that run: parsl (2).zip

ZhuozhaoLi commented 4 years ago

can you also paste the interchange.log?It should be in the runinfo/XXX/Comet_HTEX_fullcompute directory.

MatthewBM commented 4 years ago

I'm only seeing a manager.log in that directory manager.log

ZhuozhaoLi commented 4 years ago

interchange.log should be on your local machine. manager.log should be on comet.

MatthewBM commented 4 years ago

interchange.log Here it is!

ZhuozhaoLi commented 4 years ago

Looking at the log, it seems the interchange never received any message from the compute nodes on comet.

Are you sure the address 67.58.56.44:63107 is connectable from the compute nodes?

MatthewBM commented 4 years ago

I'm not sure, it might be an older run is still active, I do parsl.clear() as well as restarting my python kernel before loading a new config but my jobs on comet usually are still running so I'm not sure if I'm clearing correctly before trying again.

MatthewBM commented 4 years ago

I went back to the Parsl 8.0 release and didn't have any issues, I"ll try using exclusive nodes again when 9.0 is release and re-open this issue if I find it's still a problem there.

MatthewBM commented 4 years ago

hi @benclifford @yadudoc @annawoodard I tried this again, here's my config in parsl 9.0:

config_comet_gpu = Config( executors=[ HighThroughputExecutor( storage_access=default_staging + [GlobusStaging( endpoint_uuid=sdsc_uuid, endpoint_path="/", local_path="/")], cores_per_worker=7, label='Comet_HTEX_fullcompute', address=address_by_query(), worker_logdir_root= '/home/mmadany/parsl_jobs', working_dir = '/oasis/scratch/comet/mmadany/temp_project/parsl', provider=SlurmProvider( 'gpu-shared', channel=SSHChannel( hostname='comet.sdsc.xsede.org', username='mmadany', # <--- Update here script_dir='/home/mmadany/parsl_scripts', # <--- Update here ), launcher=SrunLauncher(),

string to prepend to #SBATCH blocks in the submit
            # script to the scheduler                                                                                                                                                                                                      
            scheduler_options='#SBATCH -A ddp140 --gres=gpu:p100:1 --ntasks-per-node=7',
            # Command to be run before starting a worker, such as:                                                                                                                                                                         
            # 'module load Anaconda; source activate parsl_env'.                                                                                                                                                                           
            worker_init='echo "hello comet"; source /home/mmadany/.bashrc; conda activate parsl; nc -z -v -w5 pestilence.crbs.ucsd.edu; parsl-globus-auth; globus endpoint activate %s' % pest_uuid,    # <--- Update here to how you setup your env
            walltime='47:00:00',
            init_blocks=1,
            max_blocks=1,
            nodes_per_block=1,
            exclusive=False
        ),
        #controller=Controller(public_ip=address_by_query(),port="63105"),
    )
]
)

My error is that I'm running out of memory on the one gpu I have assigned. After reading over the logs, it looks like parsl assigned 4 workers to that node, perhaps because 28 cpus are on that node, but only 7 are available per node/block given that config and partition. The four workers then ran on one gpu and overloaded it's memory.

Parsl / parsl

Issue requesting resources on non-exclusive nodes. #1246

string to prepend to #SBATCH blocks in the submit