approach0 / search-engine

A math-aware search engine.
http://approach0.xyz
MIT License
346 stars 50 forks source link

Docker Swarm MPI SSH connection unstable #32

Closed w32zhong closed 3 years ago

w32zhong commented 3 years ago

In the new infrastructure based on Docker Swarm, the SSH connection randomly breaks and it causes search daemons to restart. Anyone knows how to boil down the problem further?

blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:        [1,2]<stdout>:[  0] (in memo) prefix/VAR/BASE (pf=138766737, ipf=1.86)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:        [1,2]<stdout>:[  1] (in memo) prefix/NUM/SUPSCRIPT (pf=6633044, ipf=4.90)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:        [1,2]<stdout>:[  2] (on disk) prefix/VAR/BASE/HANGER (pf=9006427, ipf=4.60)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:        [1,2]<stdout>:[  3] (in memo) prefix/NUM/SUPSCRIPT/HANGER (pf=138766395, ipf=1.86)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:        [1,2]<stdout>:[  4] (on disk) prefix/VAR/BASE/HANGER/SIGN (pf=6569091, ipf=4.91)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:        [1,2]<stdout>:[  5] (on disk) prefix/VAR/BASE/HANGER/TIMES (pf=8928046, ipf=4.60)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:        [1,2]<stdout>:[  6] (on disk) prefix/NUM/SUPSCRIPT/HANGER/SIGN (pf=15976531, ipf=4.02)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:        [  7] (on disk) prefix/VAR/BASE/HANGER/SIGN/ADD (pf=30797021, ipf=3.37)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:        [  8] (on disk) prefix/VAR/BASE/HANGER/TIMES/SIGN (pf=3410477, ipf=5.57)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:        [  9] (on disk) prefix/NUM/SUPSCRIPT/HANGER/SIGN/ADD (pf=11096355, ipf=4.39)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:        [ 10] (on disk) prefix/VAR/BASE/HANGER/TIMES/SIGN/ADD (pf=9205922, ipf=4.57)
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,1]<stdout>:merge time cost: 2411 msec.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,0]<stdout>:merge time cost: 3240 msec.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,1]<stdout>:Query handle cost: 3390 msec.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,1]<stdout>:
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,3]<stdout>:merge time cost: 4611 msec.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:merge time cost: 13286 msec.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,3]<stdout>:Query handle cost: 13932 msec.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,3]<stdout>:
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:Query handle cost: 13937 msec.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,0]<stdout>:Query handle cost: 13984 msec.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,0]<stdout>:
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | Connection to blue-shard3 closed by remote host.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | --------------------------------------------------------------------------
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | ORTE was unable to reliably start one or more daemons.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | This usually is caused by:
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | 
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | * not finding the required libraries and/or binaries on
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    |   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    |   settings, or configure OMPI with --enable-orterun-prefix-by-default
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | 
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | * lack of authority to execute on one or more specified nodes.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    |   Please verify your allocation and authorities.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | 
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | * the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    |   Please check with your sys admin to determine the correct location to use.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | 
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | *  compilation of the orted with dynamic libraries when static are required
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    |   (e.g., on Cray). Please check your configure cmd line and consider using
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    |   one of the contrib/platform definitions for your system type.
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | 
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | * an inability to create a connection back to mpirun due to a
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    |   lack of common network interfaces and/or no route found between
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    |   them. Please check network connectivity (including firewalls
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    |   and network routing requirements).
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | --------------------------------------------------------------------------
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,2]<stdout>:node[2] closing index...
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,1]<stdout>:node[1] closing index...
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,3]<stdout>:node[3] closing index...
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,0]<stdout>:
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,0]<stdout>:shutdown httpd...
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,0]<stdout>:node[0] closing index...
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | [1,0]<stderr>:Caught signal: 15
blue_mpirun.1.alj956wxat8g@calabash-admin-vJesW0sEF6    | + set +x
w32zhong commented 3 years ago

It seems this issue gets fixed by #33 also. thanks god.