RenderKit / ospray

An Open, Scalable, Portable, Ray Tracing Based Rendering Engine for High-Fidelity Visualization
http://ospray.org
Apache License 2.0
982 stars 178 forks source link

Running MPI examples #495

Closed nyue closed 2 years ago

nyue commented 2 years ago

Hi,

I have built the MPI examples and was wondering what is the way to run them.

I have a cluster of 4 machines pc1,pc2,pc3,pc4. The head node where I am launching it is pc0

Do I use mpirun or are the examples designed to parse -host or -hostfile ?

Cheers

johguenther commented 2 years ago

Yes, via mpirun, for details see https://www.ospray.org/documentation.html#mpi-offload-rendering (and maybe also https://www.ospray.org/tutorials.html#mpi-distributed-tutorials).

nyue commented 2 years ago

I am still encountering some problem so narrowing down the problem step by step.

I am using the ospMPIDistribTutorialVolume as an example to test

The ospMPIDistribTutorialVolume works on the head node pc0 by itself (no mpirun)

pc0$ mpirun --host pc1,pc2,pc3,pc4 --mca btl_tcp_if_include 192.168.0.0/24 -x LD_LIBRARY_PATH /piconfs/systems/OSPray/head/bin/ospMPIDistribTutorialVolume --osp:load-modules=mpi --osp:device=mpiOffload

I get the following errors and want to confirm if the volume example is expected to work ?

Do I need to enable remote GL/EGL display ?

OSPRay rank 1/4
OSPRay rank 0/4
OSPRay rank 3/4
OSPRay rank 2/4
terminate called after throwing an instance of 'std::runtime_error'
  what():  Failed to initialize GLFW!
[pc1:09272] *** Process received signal ***
[pc1:09272] Signal: Aborted (6)
[pc1:09272] Signal code:  (-6)
[pc1:09272] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 9272 on node pc1 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
Twinklebear commented 2 years ago

Hi @nyue , I think the issue here is that you're running rank 0 (which will try to open the window) on pc1, not pc0. Could you try:

mpirun --host pc0,pc1,pc2,pc3,pc4 --mca btl_tcp_if_include 192.168.0.0/24 -x LD_LIBRARY_PATH /piconfs/systems/OSPray/head/bin/ospMPIDistribTutorialVolume

Also for the MPI Distributed applications, you don't need to pass --osp:load-modules=mpi --osp:device=mpiOffload, as it will explicitly load the MPI module (https://github.com/ospray/ospray/blob/master/modules/mpi/tutorials/ospMPIDistribTutorialSpheres.cpp#L54) and use the mpiDistributed device for data parallel rendering (https://github.com/ospray/ospray/blob/master/modules/mpi/tutorials/ospMPIDistribTutorialSpheres.cpp#L57).

The MPIDistributed examples show distributed data rendering, where the data is too large to fit on one node and is distributed over multiple nodes. The distributed applications are assumed to be MPI-aware, and take care of not opening a window on the worker ranks for example (as the tutorials do).

The mpiOffload device is for the opposite case, where the data can fit on each node and we just want to scale up compute. MPI offload applications don't actually need to know anything about MPI, you can scale up an application written for local rendering by just swapping out some command line parameters passed to ospInit. To try out offload rendering you can run:

mpirun --host pc0,pc1,pc2,pc3,pc4 --mca btl_tcp_if_include 192.168.0.0/24 -x LD_LIBRARY_PATH /piconfs/systems/OSPray/head/bin/ospExamples --osp:load-modules=mpi --osp:device=mpiOffload

Offload works transparently with a local rendering application (like ospExamples) by swapping out the device used in ospInit and turning ranks 1+ into workers.

nyue commented 2 years ago

My MPI cluster are ARM64 Jetson Nano. OIDN is not supported for the it. Is there a way to tell OSPray (e.g. ospExamples) not to look for them. Just so that I can test out the mpiOffload to verify.

picocluster@pc0:~$ /piconfs/systems/OSPray/head/bin/ospExamples
OSPRay error: could not open module lib ospray_module_denoiser: /piconfs/systems/rkcommon/1.7.0/lib/libospray_module_denoiser.so: cannot open shared object file: No such file or directory
nyue commented 2 years ago

FYI, I tried out the call to the volume rendering example, it does not error out with the glfw3 error.

However, I screen does not draw (blank), I waited for about a minute and killed the process.

FYI, I have build and run the NPB code from NAS so I know the MPI cluster does work.

Twinklebear commented 2 years ago

Does the app exit after failing to find the denoiser module? That should be configured to not exit, it should just disable the denoiser option in the app GUI.

Actually, in testing the ospMPIDistributedVolume app myself, it seems to be stuck when running on 2+ ranks. I'll take a look at what's going on there, it looks like a bug somewhere.

nyue commented 2 years ago

Yes, the application still runs even when it cannot find the denoiser but I am not sure if the return value may affects how MPI may interpret them.

nyue commented 2 years ago

I have success with ospMPIDistribTutorialReplicated

mpirun --host pc0,pc1,pc2,pc3,pc4 --mca btl_tcp_if_include 192.168.0.0/24 -x LD_LIBRARY_PATH /piconfs/systems/OSPray/head/bin/ospMPIDistribTutorialReplicated

I can see reasonable performance increase

At least I know OSPray+MPI does work on my Jetson Nano cluster.

Twinklebear commented 2 years ago

I think ospExamples should still exit with error code 0 when it doesn't find the denoiser so it should be ok w/ MPI. That's great to hear ospMPIDistributedTutorialReplicated works and scales!

I'll take a look at what's going on with the distributed rendering side of things, which is probably also related to #496

Twinklebear commented 2 years ago

This should be resolved in our 2.7.1 release: https://github.com/ospray/ospray/releases/tag/v2.7.1 , please let us know if you run into any issues