The NASA Ames Stereo Pipeline is a suite of automated geodesy & stereogrammetry tools designed for processing planetary imagery captured from orbiting and landed robotic explorers on other planets.
Apache License 2.0
478
stars
168
forks
source link
Question: Proper parallel_stereo parameterization for single node and multiple nodes #375
I am struggling to get a working parallel_stereo call working on a new cluster. Specifically, I am seeing ssh login errors related to the number of jobs and ssh connections that are opening. Before I open a specific issue, I thought it best to request a bit more information about how parallel stereo is using the parameters. Each node has 40 physical, 80 hyper-threaded cores.
My SBATCH commands look as follows:
SBATCH --cpus-per-task=80
SBATCH --ntasks=1
SBATCH --nodes=2
To generate the nodes-list to be passed to parallel stereo I am using: scontrol show hostnames $SLURM_JOB_NODELIST > nodelist.lis which is working to get the required list of nodes.
I am running steps 1-2 and 5 in serial. Below is my call. Given the CPU layout of the cluster, what is parallel_stereo expecting to see for processes, threads-multiprocess and threads-singleprocess? If I set these to the number of physical cores (or hyper threaded cores, I am getting ssh errors, even when setting the ssh config to not overload ssh).
Thanks @oleg-alexandrov or @rbeyer for any insight. I can post specific error messages if a more concrete 'here is the exact error for this setup' is helpful.
I am struggling to get a working parallel_stereo call working on a new cluster. Specifically, I am seeing ssh login errors related to the number of jobs and ssh connections that are opening. Before I open a specific issue, I thought it best to request a bit more information about how parallel stereo is using the parameters. Each node has 40 physical, 80 hyper-threaded cores.
My SBATCH commands look as follows:
SBATCH --cpus-per-task=80
SBATCH --ntasks=1
SBATCH --nodes=2
To generate the nodes-list to be passed to parallel stereo I am using:
scontrol show hostnames $SLURM_JOB_NODELIST > nodelist.lis
which is working to get the required list of nodes.I am running steps 1-2 and 5 in serial. Below is my call. Given the CPU layout of the cluster, what is parallel_stereo expecting to see for
processes
,threads-multiprocess
andthreads-singleprocess
? If I set these to the number of physical cores (or hyper threaded cores, I am getting ssh errors, even when setting the ssh config to not overload ssh).Thanks @oleg-alexandrov or @rbeyer for any insight. I can post specific error messages if a more concrete 'here is the exact error for this setup' is helpful.