In the new infrastructure based on Docker Swarm, the SSH connection randomly breaks and it causes search daemons to restart. Anyone knows how to boil down the problem further?
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,2]<stdout>: [1,2]<stdout>:[ 0] (in memo) prefix/VAR/BASE (pf=138766737, ipf=1.86)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,2]<stdout>: [1,2]<stdout>:[ 1] (in memo) prefix/NUM/SUPSCRIPT (pf=6633044, ipf=4.90)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,2]<stdout>: [1,2]<stdout>:[ 2] (on disk) prefix/VAR/BASE/HANGER (pf=9006427, ipf=4.60)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,2]<stdout>: [1,2]<stdout>:[ 3] (in memo) prefix/NUM/SUPSCRIPT/HANGER (pf=138766395, ipf=1.86)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,2]<stdout>: [1,2]<stdout>:[ 4] (on disk) prefix/VAR/BASE/HANGER/SIGN (pf=6569091, ipf=4.91)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,2]<stdout>: [1,2]<stdout>:[ 5] (on disk) prefix/VAR/BASE/HANGER/TIMES (pf=8928046, ipf=4.60)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,2]<stdout>: [1,2]<stdout>:[ 6] (on disk) prefix/NUM/SUPSCRIPT/HANGER/SIGN (pf=15976531, ipf=4.02)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,2]<stdout>: [ 7] (on disk) prefix/VAR/BASE/HANGER/SIGN/ADD (pf=30797021, ipf=3.37)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,2]<stdout>: [ 8] (on disk) prefix/VAR/BASE/HANGER/TIMES/SIGN (pf=3410477, ipf=5.57)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,2]<stdout>: [ 9] (on disk) prefix/NUM/SUPSCRIPT/HANGER/SIGN/ADD (pf=11096355, ipf=4.39)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,2]<stdout>: [ 10] (on disk) prefix/VAR/BASE/HANGER/TIMES/SIGN/ADD (pf=9205922, ipf=4.57)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,1]<stdout>:merge time cost: 2411 msec.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,0]<stdout>:merge time cost: 3240 msec.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,1]<stdout>:Query handle cost: 3390 msec.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,1]<stdout>:
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,3]<stdout>:merge time cost: 4611 msec.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,2]<stdout>:merge time cost: 13286 msec.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,3]<stdout>:Query handle cost: 13932 msec.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,3]<stdout>:
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,2]<stdout>:Query handle cost: 13937 msec.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,2]<stdout>:
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,0]<stdout>:Query handle cost: 13984 msec.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,0]<stdout>:
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | Connection to blue-shard3 closed by remote host.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | --------------------------------------------------------------------------
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | ORTE was unable to reliably start one or more daemons.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | This usually is caused by:
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 |
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | * not finding the required libraries and/or binaries on
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | one or more nodes. Please check your PATH and LD_LIBRARY_PATH
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | settings, or configure OMPI with --enable-orterun-prefix-by-default
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 |
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | * lack of authority to execute on one or more specified nodes.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | Please verify your allocation and authorities.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 |
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | Please check with your sys admin to determine the correct location to use.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 |
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | * compilation of the orted with dynamic libraries when static are required
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | (e.g., on Cray). Please check your configure cmd line and consider using
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | one of the contrib/platform definitions for your system type.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 |
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | * an inability to create a connection back to mpirun due to a
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | lack of common network interfaces and/or no route found between
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | them. Please check network connectivity (including firewalls
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | and network routing requirements).
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | --------------------------------------------------------------------------
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,2]<stdout>:node[2] closing index...
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,1]<stdout>:node[1] closing index...
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,3]<stdout>:node[3] closing index...
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,0]<stdout>:
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,0]<stdout>:shutdown httpd...
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,0]<stdout>:node[0] closing index...
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | [1,0]<stderr>:Caught signal: 15
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6 | + set +x
In the new infrastructure based on Docker Swarm, the SSH connection randomly breaks and it causes search daemons to restart. Anyone knows how to boil down the problem further?