NeoGeographyToolkit / StereoPipeline

The NASA Ames Stereo Pipeline is a suite of automated geodesy & stereogrammetry tools designed for processing planetary imagery captured from orbiting and landed robotic explorers on other planets.
Apache License 2.0
478 stars 168 forks source link

Issue with TcL install and ssh #362

Closed adehecq closed 2 years ago

adehecq commented 2 years ago

Describe the bug When running parallel_stereo with the option --nodes-list $PBS_NODEFILE, the program fails during correlation with the following two errors:

Error 1:

application-specific initialization failed: Can't find a usable init.tcl in the following directories: /home/oalexan1/miniconda3/envs/basepython/lib/tcl8.6 /usr/lib/tcl8.6 /lib/tcl8.6 /usr/library /library /tcl8.6.10/library /tcl8.6.10/library

Error 2:

ssh: symbol lookup error: ssh: undefined symbol: EVP_KDF_ctrl, version OPENSSL_1_1_1b

The program runs normally without option nodes-list.

To Reproduce I can provide the input data if needed, but the full command is

parallel_stereo --threads-multiprocess 8 --threads-singleprocess 16 --processes 2 -t nadirpinhole --alignment-method 
affineepipolar --stereo-algorithm 2 --corr-tile-size 5000 --corr-memory-limit-mb 32000 --corr-kernel 7 7 --xcorr-threshold 0 
--num-matches-from-disp-triplets 30000 img1 img2 cam1 cam2 --stop-point 5 --nodes-list $PBS_NODEFILE

Expected behavior No error should be raised and the program should continue until completion.

Error Logs, Terminal Captures, Screenshots For example: image

Your Environment (please complete the following information):

ls -l /usr/lib64/libcrypto.so* yields

lrwxrwxrwx 1 root root 19 Nov 12 09:56 /usr/lib64/libcrypto.so -> libcrypto.so.1.1.1k lrwxrwxrwx 1 root root 19 Nov 12 09:56 /usr/lib64/libcrypto.so.1.1 -> libcrypto.so.1.1.1k -rwxr-xr-x 1 root root 3079592 Nov 12 09:56 /usr/lib64/libcrypto.so.1.1.1k

Additional context After some email exchanges with Oleg, several things have been tested:

Regarding error 1:

Regarding error 2, we tried:

oleg-alexandrov commented 2 years ago

Thank you for raising this. Yeah, issue 1 has been solved in the latest build for a while.

I put a fix for issue 2. This was quite time-consuming to debug, and I also had to get a 10 GB ISO image to make a CentoOS 8 VM on which to reproduce this.

The problem seems to be traceable to the fact that ASP's wrapper shell scripts set up LD_LIBRARY_PATH to ensure our libs are used with our tools. But then GNU parallel is called, which starts ssh, which does not like our libs. The fix is to temporarily hide our libs when running ssh, then get them back in the child processes launched by ssh.

Not too clever, but the fix being done in parallel_stereo itself rather than in our wrapper scripts, it means it will work with ASP's conda distribution too.

Now, this issue is bigger than ASP iself. Conda users will run into this even with other tools (https://github.com/conda/conda/issues/10241).

I put a fix also to parallel_bundle_adjust, mapproject, and parallel_sfs.

The nightly build at https://github.com/NeoGeographyToolkit/StereoPipeline/releases will have this with build date 2022-04-08 or later.