NeoGeographyToolkit / StereoPipeline

The NASA Ames Stereo Pipeline is a suite of automated geodesy & stereogrammetry tools designed for processing planetary imagery captured from orbiting and landed robotic explorers on other planets.
Apache License 2.0
478 stars 168 forks source link

Question: Proper parallel_stereo parameterization for single node and multiple nodes #375

Closed jlaura closed 1 year ago

jlaura commented 1 year ago

I am struggling to get a working parallel_stereo call working on a new cluster. Specifically, I am seeing ssh login errors related to the number of jobs and ssh connections that are opening. Before I open a specific issue, I thought it best to request a bit more information about how parallel stereo is using the parameters. Each node has 40 physical, 80 hyper-threaded cores.

My SBATCH commands look as follows:

SBATCH --cpus-per-task=80

SBATCH --ntasks=1

SBATCH --nodes=2

To generate the nodes-list to be passed to parallel stereo I am using: scontrol show hostnames $SLURM_JOB_NODELIST > nodelist.lis which is working to get the required list of nodes.

I am running steps 1-2 and 5 in serial. Below is my call. Given the CPU layout of the cluster, what is parallel_stereo expecting to see for processes, threads-multiprocess and threads-singleprocess? If I set these to the number of physical cores (or hyper threaded cores, I am getting ssh errors, even when setting the ssh config to not overload ssh).

parallel_stereo --nodes-list=nodelist.lis --entry-point 3 --stop-point 5 $L $R \
    -s ${config} ${odir}/${prefix}_ba_map --bundle-adjust-prefix adjust/ba # \
    --processes 1 --threads-multiprocess 2 --threads-singleprocess 2

Thanks @oleg-alexandrov or @rbeyer for any insight. I can post specific error messages if a more concrete 'here is the exact error for this setup' is helpful.

jlaura commented 1 year ago

Closing because I realized this repo has a Q&A discussion section!

oleg-alexandrov commented 1 year ago

Continued at https://github.com/NeoGeographyToolkit/StereoPipeline/discussions/379.