ComputationalRadiationPhysics / haseongpu

HASEonGPU: High performance Amplified Spontaneous Emission on GPU
http://www.hzdr.de/crp
Other
7 stars 5 forks source link

Parallel Architecture for the upcoming alpaka integration #95

Open erikzenker opened 9 years ago

erikzenker commented 9 years ago

Alpaka provides the possibility to describe algorithms (kernels) in a abstract form, such that these algorithms are executeable on several hardware architectures e.g.: cpu, multi cpu, nvidia accelerators or xeon phis.

The clear goal is to run HASEonGPU on other hardware than NVIDIA accelerators and I think also to run HASEonGPU on varying accelerators/devices on the same time. To achieve that, we need to think about how to distribute workload locally to varying devices and globally to compute nodes.

Every device corresponds to a peer

This design would be more or less equal to the current design where each peer manages one NVIDIA accelerator (accept the master):

Each peer...

  1. Grabs a free device
  2. Requests a sample point
  3. Runs the kernel on this device
  4. Sends the result back
  5. Requests a new sample point

Cons:

Pros:

In this design a single peer could request sample points for all available devices on its node and use the alpaka async streams to start multiple kernels in parallel.

Each peer:

  1. Grabs all devices it can get
  2. Request as many sample points as devices
  3. Starts a kernel on each device
  4. Sends a result back when a device has finished
  5. Requests sample points for finished devices

Pros:

Discuss !

bussmann commented 9 years ago

2, since the Con isn't one.

erikzenker commented 9 years ago

Okay, its a Con for my unconscious mind, that does not want to break up the current design. Update!

slizzered commented 9 years ago

In strategy 1, does the peer release the device after it returns the sample point? (My question is, why does it first look for the sample point, and only then grab a device).

I like the first strategy, since hierarchies are kept flat and simple, but I can see the benefits of auto-adjusting the number of devices per node by using only a single peer.

My idea about strategy 2:Use the one peer per node approach, but spawn an additional thread for each Accelerator and CPU that takes part in the computation. They can use the original thread for communication and create some form of hierarchy. That way, we can keep a clear separation of parallel computation and communication.

erikzenker commented 9 years ago

Okay, approach 1 works also when the device is grabbed first. It is also more efficient to not grab a device again and again. Update!

Your idea about strategy 2 looks looks like this ?

hierarchy

Thus, there are two hierarchies of communication ?

slizzered commented 9 years ago

Yes, that is about what I thought of. The communication thread would be only very lightweight to act as a an abstraction layer, so the compute threads don't have to change too much (basically, only replace main.cc and adapt calc_phi_ase_graybat.cc and keep most of the underlying compute-things).

I'm not sure about the mesh, but if we can put mesh-creation in a deeper layer (inside the compute-thread), the whole communication will also be separated from alpaka.

ax3l commented 9 years ago

I think strategy 2 is way more complicated to implement and strategy 1 "Every device corresponds to a peer" is does not require building yet-an-other scheduler that takes care about the devices in the rank.

slizzered commented 9 years ago

Yes, strategy 1 is very easy in comparison and so far we had a lot of success with the KISS principle behind it.

I see the most interesting use of strategy 2 when using very heterogeneous clusters where it is difficult to start the correct number of peers for each node.

erikzenker commented 9 years ago

I would prefer strategy 1, because its simple. And I think its not a big thing to go from strategy 1 strategy 2 later.

ax3l commented 9 years ago

totally agree, also connecting various backend over the "same" abstract communication layer is already a nice task.

@slizzered

I see the most interesting use of strategy 2 when using very heterogeneous clusters where it is difficult to start the correct number of peers for each node.

I actually think that might still be possible in 1, one just needs a communication layer that can asynchronously create communicators (MPI) / add new global "ranks" (ZeroMQ sockets). strategy 2 will naturally grow from that (in case new ranks are not globally announced).

bussmann commented 9 years ago

Then let's do 1 and see how it works out. Concentrate on alpaka, not haseongpu redesigns.

slizzered commented 9 years ago

:+1: