ComputationalRadiationPhysics / isaac

In Situ Animation of Accelerated Computations :microscope:
http://ComputationalRadiationPhysics.github.io/isaac/
GNU Lesser General Public License v3.0
25 stars 15 forks source link

isaac server does not work on taurus #84

Open PrometheusPi opened 5 years ago

PrometheusPi commented 5 years ago

I compiled isaac following @FelixTUD instructions. No errors occurred during build and the isaac server could run on the compute nodes. The server listens to all three ports (...58=tcp, ...59=web, ...60=sim) and they seem to work:

However, the simulation never shows up in the client - thus observation is not possible.

Doing the same with a server compiled on hypnos and port forwarding the sim port from PIConGPU to this server works flawlessly.

ax3l commented 5 years ago

connections from all three ports are correctly opened

where did you connect from? locally? Opening a port on localhost/loopback is a different thing than a remotely accessible port. You might want to deploy SSH remote port forwarding to your local machine to make the port reachable by a (locally) running HTML browser-client.

# example, if running on hypnos5 on localhost
ssh -p 22 -L 2459:hypnos5:2459 myUserName@uts.fz-rossendorf.de -N

# connect in your browser now to
# localhost:2459
# http://laser.plasma.ninja/isaac/interface.htm

We might have to hop twice here (login node, then head-node and on that one to its localhost):

# $HOME/.ssh/config
Host taurushop
    Hostname tauruslogin6.hrsk.tu-dresden.de
    ProxyCommand ssh sYourNumber@login2.zih.tu-dresden.de -W %h:22
    User sYourNumber
    # ForwardX11 yes
    # speeds up rsync tremendously
    #ControlMaster auto
    #ControlPath /home/richard/.ssh/master-%r@%h:%p
    #ControlPersist yes

then:

ssh -p 22 -L 2459:localhost:2459 taurushop -N

the client (interface.htm) clearly identifies the web port as correct while others are identified as not correct - thus this appears to work too.

I don't understand. The HTML client only connects to the web-port of the server. It doesn't interact with the tcp port. The sim port is for the PIConGPU plugin to connect to: https://computationalradiationphysics.github.io/isaac/doc/server/index.html

However, the simulation never shows up in the client - thus observation is not possible.

That's the actual issue, try port forwarding please and check if your browser shows any kind of warnings/errors (also in the browser console) and make sure neither isaac-server nor the ssh forwarding command is outputting errors/warnings while you connect. If we see errors here, we might need to ask for an exception to allow even ssh-based port tunneling.

It could also just be that isaac-server and the simulation did not find each other. Are you sure you started the isaac server program and the PIConGPU plugin --isaac.url <hostname> to the same headnode? Don't use a generic name, they use DNS load balancing. isaac server prints a message when a simulation connects on its sim-port, did you get that one?

PrometheusPi commented 5 years ago

@ax3l I did port forward correctly (using multiple hops if necessary) from

Just to make absolutely sure that connections work, I spawned a firefox on taurus and accessed the server directly without any port forwarding. Same result (no simulation).

As written above, the connection worked and showed up in the server output correctly when connecting and when disconnecting.

The tcp port is neither used directly by the client nor the simulation but allows to access simulation meta data. It showed the correct simulation data. Thus simulation and server found each other.

Just to make sure the DNS was resolved correctly I used both the Ethernet IP, the hostname and the hostname+zih-url. Always the same result. All these connections showed up in the server log. None showed the simulation.

The same workflow of port forwarding worked flawlessly when the server is running on hypnos.

ax3l commented 5 years ago

On which headnode of Taurus are you currently running? I will connect myself and verify, since this report is far from reproducible without any reference of PIConGPU startup options, used scripts or isaac server startup commands.

PrometheusPi commented 5 years ago

Currently I am not running the server on taurus anymore. Please do not run tests at the moment that might interfere with the running simulation, as this is a production run for next weeks conference.

ax3l commented 5 years ago

Please try to document all used startup commands and outputs, otherwise remote people like @FelixTUD or @theZiz have zero chance to support you.

Please do not run test

I don't want to start anything, just want to connect-check to a running isaac server with a running PIConGPU connected to it with my own scripts, posted above. Do you mind repeating your test, e.g. one a single node, so one can take a look?

PrometheusPi commented 5 years ago

The setup without the final *.cfg can be found here.

Both @FelixTUD and @tdd11235813 were already informed about the current status of this issue via mail and in person. (This is just for documentation.)

I can start up the server for you - but I have no resources left to run a second PIConGPU simulation (even on a single node) to connect to the server for testing purposes.

ax3l commented 5 years ago

Both @FelixTUD and @tdd11235813 were already informed about the current status of this issue via mail and in person. (This is just for documentation.)

Please write so next time, so people know that this can be ignored.

but I have no resources left to run a second PIConGPU simulation (even on a single node)

ah too bad, a single node would be enough. which headnode?

PrometheusPi commented 5 years ago

No headnode is used since they limit runtime. I used a compute node. Currently it is taurusi2084. Ports are 22458 till 22460.

ax3l commented 5 years ago

Connection works. You can run a sample test-source on CPU (instead of PIConGPU):

https://github.com/ComputationalRadiationPhysics/isaac/tree/master/example

PrometheusPi commented 5 years ago

I build the example but it crashes on start (I did not yet update the destination and the port) due to a floating point error

Using name Example_936943
[tauruslogin6:5982 :0] Caught signal 8 (Floating point exception)
==== backtrace ====
 2 0x00000000000687bc mxm_handle_error()  /var/tmp/OFED_topdir/BUILD/mxm-3.6.3104/src/mxm/util/debug/debug.c:641
 3 0x0000000000068d0c mxm_error_signal_handler()  /var/tmp/OFED_topdir/BUILD/mxm-3.6.3104/src/mxm/util/debug/debug.c:616
 4 0x0000000000035270 killpg()  ??:0
 5 0x0000000000428d75 main()  /home/s5960712/lib/isaac/example/example.cpp:311
 6 0x0000000000021c05 __libc_start_main()  ??:0
 7 0x0000000000426e39 _start()  ??:0
===================
Gleitkomma-Ausnahme

Additionally it looks like it uses CUDA. Might this cause the error since I execute the test on a CPU only node?

ax3l commented 5 years ago

Yes, just pass -DISAAC_CUDA=OFF -DISAAC_ALPAKA=ON -DALPAKA_ACC_CPU_B_SEQ_T_SEQ_ENABLE=ON to CMake.

PrometheusPi commented 5 years ago

With your extra flags I get the following error:

CMake Error at CMakeLists.txt:35 (ALPAKA_ADD_EXECUTABLE):
  Unknown CMake command "ALPAKA_ADD_EXECUTABLE".

(I tried this already via ccmake - got the same error.)

I am running on @FelixTUD branch.

ax3l commented 5 years ago

Set -DALPAKA_ROOT= to the location of alpaka (forgot to mention). (Maybe also -DISAAC_DIR= to the location of ISAAC lib (lib/), but default should work.)

PrometheusPi commented 5 years ago

might be related to the libwebsocket issue #93 also seen on hemera - will test this again