ComputationalRadiationPhysics / picongpu

Performance-Portable Particle-in-Cell Simulations for the Exascale Era :sparkles:
https://picongpu.readthedocs.io
Other
705 stars 218 forks source link

allow iterate over particles on the host #864

Closed psychocoderHPC closed 8 years ago

psychocoderHPC commented 9 years ago

Currently we change the memory representation of our linked frames to on big frame to iterate over particles on the host side. It would be nice if we can copy the mallocMC buffer to the host side and iterate over the particles with the same algorithms as in the cuda kernel code.

Options to implement this:

anshumang commented 9 years ago

Starting implementation of the 2nd option...

Is getEmptyFrame() where device memory is malloc'd using mallocMC?

If so, then are there any other kernels that malloc device memory besides -

psychocoderHPC commented 9 years ago

In the current dev there were also some functors with getEmptyFrame().

% grep getEmptyFrame * -Rn
examples/SingleParticleTest/include/particles/ParticlesInitOneParticle.hpp:52:    frame = &(pb.getEmptyFrame());
examples/SingleParticleCurrent/include/particles/ParticlesInitOneParticle.hpp:52:    frame = &(pb.getEmptyFrame());
examples/SingleParticleRadiationWithLaser/include/particles/ParticlesInitOneParticle.hpp:52:    frame = &(pb.getEmptyFrame());
src/libPMacc/include/particles/ParticlesBase.kernel:142:            destFrames[threadIdx.x] = &(pb.getEmptyFrame());
src/libPMacc/include/particles/ParticlesBase.kernel:180:                        destFrames[threadIdx.x] = &(pb.getEmptyFrame());
src/libPMacc/include/particles/ParticlesBase.kernel:598:            frame = &(pb.getEmptyFrame());
src/libPMacc/include/particles/memory/boxes/ParticlesBox.hpp:74:    DINLINE FRAME &getEmptyFrame()
src/picongpu/include/plugins/kernel/CopySpeciesGlobal2Local.kernel:117:            DestFramePtr tmpFrame = &(destBox.getEmptyFrame());
src/picongpu/include/particles/ionization/ionization.hpp:288:                    electronFrame = &(electronBox.getEmptyFrame());
src/picongpu/include/particles/ionization/ionization.hpp:321:                    electronFrame = &(electronBox.getEmptyFrame());
src/picongpu/include/particles/Particles.kernel:78:            myFrame = &(myBox.getEmptyFrame());
src/picongpu/include/particles/Particles.kernel:96:                myFrame = &(myBox.getEmptyFrame());
src/picongpu/include/particles/ParticlesInit.kernel:121:        frame = &(pb.getEmptyFrame());
src/picongpu/include/particles/ParticlesInit.kernel:174:            frame = &(pb.getEmptyFrame());
src/picongpu/include/particles/manipulators/CreateParticlesFromParticleImpl.hpp:103:                    destFrame = &(destParBox.getEmptyFrame());
src/picongpu/include/particles/manipulators/CreateParticlesFromParticleImpl.hpp:142:                    destFrame = &(destParBox.getEmptyFrame())
anshumang commented 9 years ago

Followup question on 2nd option described above by @psychocoderHPC - if we copy the species data to big frame (struct of arrays) on the host we lose the option to access the particles supercell wise

When do we want to access particles by supercell?

ax3l commented 9 years ago

cc'ing @slizzered :)

ax3l commented 9 years ago

@anshumang when we copy the full memory allocated by mallocMC, the following objects will be copied:

That would allow us to access the particles, if we want, also supercell-wise on the host which is for some algorithms, e.g., while being prototyped beneficial. That can be useful for cutoff-ratio force calculations or binary collisions (particle-particle), to analyze sub sets of the whole particle set (because one can select these by position without accessing all 100M+ particles of a GPU) and similar operations where locality is of importance.

anshumang commented 9 years ago

:+1: got it @psychocoderHPC So,1st option described in the feature request above, when implemented, will give access to the meta information also. Correct?

anshumang commented 9 years ago

How does the host memory requirement vary if >1 GPUs are attached to a node?

ax3l commented 9 years ago

sure it does, the host memory requirement is proportional to all GPUs on each node that are utilized during a simulation. Each GPU will copy it's full particle memory (usually 75% of the GPU memory) to the host with the deep copy.

currently, we save some memory on the host by only copying species after species (usually one uses between 1 and 3 to 5 particle species per run).

anshumang commented 9 years ago

So now on clusters with host-to-device memory >3-5x (relatively cheap), can have as much as 3-5 GPUs/node which we'd lose with the deep copy....I guess 3-5 GPUs/node may not be a common case...but not leaving enough headroom even for 2 GPUs/node on clusters with lower host-to-device memory might be a problem...I'm curious to compare performance with scaling up (adding GPUs per node) v/s scaling out (adding more nodes)...

ax3l commented 9 years ago

well, when planning a GPU node for a cluster one should usually do the following calculation: "how much memory do I have summing up all my GPUs in this node -> X" and then add 2 to 4 times that memory to the host (which is an extremely small investment compared to the accelerators).

I mean: adding 16 GB ECC ram (about $150) per $4k to $9k Tesla GPU is worth it and is consequently done so on any large scale HPC system I know of :)

As an example, the HZDR hypnos cluster has a K80 queue with 8 GPUs (4x K80) per node which makes, ignoring losses due to ECC, up for ~100 GB (4x24 or 8x12 GB) of memory in the GPUs alone. In addition, each node is equipped with 256 GB of RAM on the host side and 4 CPU cores per GPU core.

An other example is the ORNL Titan cluster adding one K20 with 6GB RAM and even 32 GB on the host (a factor of 5 more).

Personally, I do not think that small memory designs on the host side are the right way to go and current designs in correlation with the current extremely low costs for host RAM are not leading to a problem in that direction. Comparing again with the plans for multi-GPU per node clusters such as Summit and similar accelerator clusters are endorsing the same concept.

anshumang commented 9 years ago

:+1: thanks for the detailed picture... so in hypnos, with 75% device memory allocated to particle data (72 GB), the memory requirement on host can get up to 360 GB (72x5) for a run with 5 particles species, is this correct?

ax3l commented 9 years ago

ah I see your question. no, the exemplary 72 GB per node are of course (dynamically) shared between all 5 particle species.

what @psychocoderHPC was talking about last time was a micro-optimization how to transfer species-wise parts of the 72 GB to the host (roughly in 5 chunks of 14 GB), so nothing to worry about in the above mentioned setups.

anshumang commented 9 years ago

:+1: got it, thanks for the explanation and apologize for the confusion :)

ax3l commented 9 years ago

no problem, apologies from our side for confusing you :)

ax3l commented 8 years ago

@psychocoderHPC should be implemented by now, isn't it? :)

psychocoderHPC commented 8 years ago

Yes it is implemented and used here.

IMO we can close it.

ax3l commented 8 years ago

sweet :)