Open MatthewBM opened 4 years ago
For point 1, cores_per_node
arrived in commit 636ea5986fe5d211ab3216b16fce6c22f8bf5554 a few days ago, and which hasn't been in any release yet. To get that, you could install the latest parsl from master
, using something like pip install git+https://github.com/parsl/parsl
.
I think this is related to #660 . Perhaps a quick way to deal with that is to put any other slurm options like #SBATCH ntasks-per-node=X
in the scheduler_options
to overwrite the default one.
@ZhuozhaoLi I think this has been addressed in the latest commits and would recommend trying the latest first with pip install git+https://github.com/parsl/parsl
as @benclifford suggested. @MatthewBM Let us know if that doesn't work for you as it should be addressed on our end.
Hello @benclifford @ZhuozhaoLi @annawoodard Thank you for your response.
Yes I can request the nodes successfully now but unfortunately since I've updated to the most recent commit, I'm getting BadRegistration errors no matter what Config I use, including the one I have at the top of this thread. I've attached the log from the runinfo on the local computer.
Can you check if the version numbers match on your system and that on comet ? Looks like a version mismatch.
Ok thanks @yadudoc
I'm getting the node to run but it cancels soon after and no workers connect with the following config
config_comet_compute = Config(
executors=[
HighThroughputExecutor(
cores_per_worker=12,
mem_per_worker=62,
label='Comet_HTEX_fullcompute',
address=address_by_query(),
worker_logdir_root= '/home/mmadany/parsl_jobs',
working_dir = '/oasis/scratch/comet/mmadany/temp_project/parsl',
worker_port_range=(63105, 63114),
#worker_ports='63105',
provider=SlurmProvider(
'shared',
channel=SSHChannel(
hostname='comet.sdsc.xsede.org',
username='mmadany', # <--- Update here
script_dir='/home/mmadany/parsl_scripts', # <--- Update here
),
cores_per_node=12,
launcher=SrunLauncher(),
# string to prepend to #SBATCH blocks in the submit
# script to the scheduler
scheduler_options='#SBATCH -A ddp140 --mem=63G',
# Command to be run before starting a worker, such as:
# 'module load Anaconda; source activate parsl_env'.
worker_init='echo "hello comet"; source /home/mmadany/.bashrc; conda activate parsl; nc -z -v -w5 pestilence.crbs.ucsd.edu 63105', # <--- Update here to how you setup your env
walltime='12:00:00',
init_blocks=1,
max_blocks=1,
nodes_per_block=1,
exclusive=False
),
#controller=Controller(public_ip=address_by_query(),port="63105"),
)
]
)
From that run: parsl (2).zip
can you also paste the interchange.log
?It should be in the runinfo/XXX/Comet_HTEX_fullcompute directory.
I'm only seeing a manager.log in that directory manager.log
interchange.log
should be on your local machine. manager.log
should be on comet.
interchange.log Here it is!
Looking at the log, it seems the interchange never received any message from the compute nodes on comet.
Are you sure the address 67.58.56.44:63107 is connectable from the compute nodes?
I'm not sure, it might be an older run is still active, I do parsl.clear() as well as restarting my python kernel before loading a new config but my jobs on comet usually are still running so I'm not sure if I'm clearing correctly before trying again.
I went back to the Parsl 8.0 release and didn't have any issues, I"ll try using exclusive nodes again when 9.0 is release and re-open this issue if I find it's still a problem there.
hi @benclifford @yadudoc @annawoodard I tried this again, here's my config in parsl 9.0:
config_comet_gpu = Config( executors=[ HighThroughputExecutor( storage_access=default_staging + [GlobusStaging( endpoint_uuid=sdsc_uuid, endpoint_path="/", local_path="/")], cores_per_worker=7, label='Comet_HTEX_fullcompute', address=address_by_query(), worker_logdir_root= '/home/mmadany/parsl_jobs', working_dir = '/oasis/scratch/comet/mmadany/temp_project/parsl', provider=SlurmProvider( 'gpu-shared', channel=SSHChannel( hostname='comet.sdsc.xsede.org', username='mmadany', # <--- Update here script_dir='/home/mmadany/parsl_scripts', # <--- Update here ), launcher=SrunLauncher(),
string to prepend to #SBATCH blocks in the submit
# script to the scheduler scheduler_options='#SBATCH -A ddp140 --gres=gpu:p100:1 --ntasks-per-node=7', # Command to be run before starting a worker, such as: # 'module load Anaconda; source activate parsl_env'. worker_init='echo "hello comet"; source /home/mmadany/.bashrc; conda activate parsl; nc -z -v -w5 pestilence.crbs.ucsd.edu; parsl-globus-auth; globus endpoint activate %s' % pest_uuid, # <--- Update here to how you setup your env walltime='47:00:00', init_blocks=1, max_blocks=1, nodes_per_block=1, exclusive=False ), #controller=Controller(public_ip=address_by_query(),port="63105"), ) ]
)
My error is that I'm running out of memory on the one gpu I have assigned. After reading over the logs, it looks like parsl assigned 4 workers to that node, perhaps because 28 cpus are on that node, but only 7 are available per node/block given that config and partition. The four workers then ran on one gpu and overloaded it's memory.
Hello,
I am trying to request nodes with the following config:
However I always get the message 'slurm: attempt to provision nodes by provider has failed', I've noticed that the submission script uses the Sbatch variable 'ntasks-per-node=1' which has two problems:
1: I would like it to request a minimum and maximum of 16 CPUs with that config, because I'm focusing on memory allocation per worker, however my slurm provider does not have the option to set cpus_per_node.