Closed psychocoderHPC closed 8 years ago
Starting implementation of the 2nd option...
Is getEmptyFrame() where device memory is malloc'd using mallocMC?
If so, then are there any other kernels that malloc device memory besides -
In the current dev there were also some functors with getEmptyFrame()
.
% grep getEmptyFrame * -Rn
examples/SingleParticleTest/include/particles/ParticlesInitOneParticle.hpp:52: frame = &(pb.getEmptyFrame());
examples/SingleParticleCurrent/include/particles/ParticlesInitOneParticle.hpp:52: frame = &(pb.getEmptyFrame());
examples/SingleParticleRadiationWithLaser/include/particles/ParticlesInitOneParticle.hpp:52: frame = &(pb.getEmptyFrame());
src/libPMacc/include/particles/ParticlesBase.kernel:142: destFrames[threadIdx.x] = &(pb.getEmptyFrame());
src/libPMacc/include/particles/ParticlesBase.kernel:180: destFrames[threadIdx.x] = &(pb.getEmptyFrame());
src/libPMacc/include/particles/ParticlesBase.kernel:598: frame = &(pb.getEmptyFrame());
src/libPMacc/include/particles/memory/boxes/ParticlesBox.hpp:74: DINLINE FRAME &getEmptyFrame()
src/picongpu/include/plugins/kernel/CopySpeciesGlobal2Local.kernel:117: DestFramePtr tmpFrame = &(destBox.getEmptyFrame());
src/picongpu/include/particles/ionization/ionization.hpp:288: electronFrame = &(electronBox.getEmptyFrame());
src/picongpu/include/particles/ionization/ionization.hpp:321: electronFrame = &(electronBox.getEmptyFrame());
src/picongpu/include/particles/Particles.kernel:78: myFrame = &(myBox.getEmptyFrame());
src/picongpu/include/particles/Particles.kernel:96: myFrame = &(myBox.getEmptyFrame());
src/picongpu/include/particles/ParticlesInit.kernel:121: frame = &(pb.getEmptyFrame());
src/picongpu/include/particles/ParticlesInit.kernel:174: frame = &(pb.getEmptyFrame());
src/picongpu/include/particles/manipulators/CreateParticlesFromParticleImpl.hpp:103: destFrame = &(destParBox.getEmptyFrame());
src/picongpu/include/particles/manipulators/CreateParticlesFromParticleImpl.hpp:142: destFrame = &(destParBox.getEmptyFrame())
Followup question on 2nd option described above by @psychocoderHPC - if we copy the species data to big frame (struct of arrays) on the host we lose the option to access the particles supercell wise
When do we want to access particles by supercell?
cc'ing @slizzered :)
@anshumang when we copy the full memory allocated by mallocMC, the following objects will be copied:
That would allow us to access the particles, if we want, also supercell-wise on the host which is for some algorithms, e.g., while being prototyped beneficial. That can be useful for cutoff-ratio force calculations or binary collisions (particle-particle), to analyze sub sets of the whole particle set (because one can select these by position without accessing all 100M+ particles of a GPU) and similar operations where locality is of importance.
:+1: got it @psychocoderHPC So,1st option described in the feature request above, when implemented, will give access to the meta information also. Correct?
How does the host memory requirement vary if >1 GPUs are attached to a node?
sure it does, the host memory requirement is proportional to all GPUs on each node that are utilized during a simulation. Each GPU will copy it's full particle memory (usually 75% of the GPU memory) to the host with the deep copy.
currently, we save some memory on the host by only copying species after species (usually one uses between 1 and 3 to 5 particle species per run).
So now on clusters with host-to-device memory >3-5x (relatively cheap), can have as much as 3-5 GPUs/node which we'd lose with the deep copy....I guess 3-5 GPUs/node may not be a common case...but not leaving enough headroom even for 2 GPUs/node on clusters with lower host-to-device memory might be a problem...I'm curious to compare performance with scaling up (adding GPUs per node) v/s scaling out (adding more nodes)...
well, when planning a GPU node for a cluster one should usually do the following calculation: "how much memory do I have summing up all my GPUs in this node -> X" and then add 2 to 4 times that memory to the host (which is an extremely small investment compared to the accelerators).
I mean: adding 16 GB ECC ram (about $150) per $4k to $9k Tesla GPU is worth it and is consequently done so on any large scale HPC system I know of :)
As an example, the HZDR hypnos cluster has a K80 queue with 8 GPUs (4x K80) per node which makes, ignoring losses due to ECC, up for ~100 GB (4x24 or 8x12 GB) of memory in the GPUs alone. In addition, each node is equipped with 256 GB of RAM on the host side and 4 CPU cores per GPU core.
An other example is the ORNL Titan cluster adding one K20 with 6GB RAM and even 32 GB on the host (a factor of 5 more).
Personally, I do not think that small memory designs on the host side are the right way to go and current designs in correlation with the current extremely low costs for host RAM are not leading to a problem in that direction. Comparing again with the plans for multi-GPU per node clusters such as Summit and similar accelerator clusters are endorsing the same concept.
:+1: thanks for the detailed picture... so in hypnos, with 75% device memory allocated to particle data (72 GB), the memory requirement on host can get up to 360 GB (72x5) for a run with 5 particles species, is this correct?
ah I see your question. no, the exemplary 72 GB per node are of course (dynamically) shared between all 5 particle species.
what @psychocoderHPC was talking about last time was a micro-optimization how to transfer species-wise parts of the 72 GB to the host (roughly in 5 chunks of 14 GB), so nothing to worry about in the above mentioned setups.
:+1: got it, thanks for the explanation and apologize for the confusion :)
no problem, apologies from our side for confusing you :)
@psychocoderHPC should be implemented by now, isn't it? :)
Yes it is implemented and used here.
IMO we can close it.
sweet :)
Currently we change the memory representation of our linked frames to on big frame to iterate over particles on the host side. It would be nice if we can copy the
mallocMC
buffer to the host side and iterate over the particles with the same algorithms as in the cuda kernel code.Options to implement this:
mallocMC
(currently there is no way to copy the used memory to the host) see https://github.com/ComputationalRadiationPhysics/mallocMC/issues/28