AMReX-Codes / amrex

AMReX: Software Framework for Block Structured AMR
https://amrex-codes.github.io/amrex
Other
515 stars 338 forks source link

efficient parallelism with ParticleContainer #4027

Closed S-Explorer closed 5 days ago

S-Explorer commented 2 weeks ago

Dear developer: I have a question about efficient parallelism with ParticleContainer

In our simulation, we have many solid spheres (10,000+), and we want to build ~700 particles on the surface. These particles need to interact with the Eulerian background mesh.

Here is our current approach. First, all the procs store the information of all the solid spheres. Second, we sequentially loop over all the solid spheres. For each solid sphere, we build a ParticleContainer which holds ~700 particles. We then distribute these particles into different procs by using the redistribute function (redistribute any procs). Next, these particles interact with the Eulerian mesh. The procs containing these particles need to join the interaction process. Finally, we go to the next solid sphere and update the information of ~700 particles within the previous ParticleContainer, which means there is only one ParticleContainer object in our simulation.

We chose this approach to save memory, because we do not want to store ~700 * 10000 particles for all solid spheres. Instead, we only store ~700 particles for one solid sphere within one ParticleContainer and reuse these particles. However, the parallel efficiency is very low since the idle procs need to wait the busy procs while looping each sphere. Here the busy procs are the process that are involved after the redistribute function and join the Eulerian-Particle interaction.

So how to achieve more parallelism? Any suggestions?

asalmgren commented 2 weeks ago

You could create a single ParticleContainer that contains all the particles.

The particles will be distributed to the different processors based on their location, which means that the particles on each sphere will most likely be on the same processor. The mesh data will be distributed to the different processors, and since the particles on the same sphere will be talking to the same region of the background mesh, that operation will be local on each processor.

Note that the particle data will be distributed in this case -- each processor only stores the data for the particles it "owns".

Does this make sense?

On Tue, Jul 9, 2024 at 7:18 AM AoooE @.***> wrote:

Dear developer: I have a question about efficient parallelism with ParticleContainer

The sketch below shows two spheres with AMR and the related particles on the surface. In our simulation, we have many solid spheres (10,000+), and we want to build ~700 particles on the surface. These particles need to interact with the Eulerian background mesh. default.png (view on web) https://github.com/AMReX-Codes/amrex/assets/32592710/7289d3fb-59de-4004-9254-3169aa9ef79d

Here is our current approach. First, all the procs store the information of all the solid spheres. Second, we sequentially loop over all the solid spheres. For each solid sphere, we build a ParticleContainer which holds ~700 particles. We then distribute these particles into different procs by using the redistribute function (redistribute any procs). Next, these particles interact with the Eulerian mesh. The procs containing these particles need to join the interaction process. Finally, we go to the next solid sphere and update the information of ~700 particles within the previous ParticleContainer, which means there is only one ParticleContainer object in our simulation.

We chose this approach to save memory, because we do not want to store ~700 10000 particles for all solid spheres. Instead, we only store ~700 particles for one solid sphere within one ParticleContainer and reuse these particles. However, the parallel efficiency is very low since the idle procs need to wait the busy procs* while looping each sphere. Here the busy procs are the process that are involved after the redistribute function and join the Eulerian-Particle interaction.

So how to achieve more parallelism? Any suggestions?

— Reply to this email directly, view it on GitHub https://github.com/AMReX-Codes/amrex/issues/4027, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRE6YTRMIHMDMWVKVCG2Q3ZLPWMZAVCNFSM6AAAAABKS7G4YKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGM4TQMZVG44TONI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Ann Almgren Senior Scientist; Dept. Head, Applied Mathematics Pronouns: she/her/hers

ruohai0925 commented 2 weeks ago

Hi Ann,

I am adding more clarifications here since I am working with Xuzhu (@S-Explorer) in this work. Our objective is to combine the diffused immersed boundary method (part of my Ph.D. thesis) with IAMR to deal with the resolved DNS with many particles.

Based on your suggestions, maybe we have to do some balance. If we build a single ParticleContainer that contains all the particles, then we will store the info of 10000 * 700 ~7 million particles. Currently, we only store ~700 particles of one sphere and sequentially loop over each sphere and reuse these ~700 particles.

Or we can build N ParticleContainers. Here N = the number of CPUs used in the simulation. And each ParticleContainer contains ~700 particles. The point here is the particles associated with one CPU may be distributed into other CPUs after using the redistribute() function. So it is hard to parallel.

At least, I think we should simultaneously calculate all spheres using all CPUs, instead of looping over each sphere sequentially.

Jordan

WeiqunZhang commented 2 weeks ago

7 million particles is not a big number unless each particle has a lot of data. How many bytes do you need per particle? How many processes do you plan to use? How much memory do you have per process? Could you provide some numbers?

On Tue, Jul 9, 2024, 10:53 PM Yadong_Zeng @.***> wrote:

Hi Ann,

I am adding more clarifications here since I am working with Xuzhu ( @S-Explorer https://github.com/S-Explorer) in this work. Our objective is to combine the diffused immersed boundary method (part of my Ph.D. thesis) with IAMR to deal with the resolved DNS with many particles.

Based on your suggestions, maybe we have to do some balance. If we build a single ParticleContainer that contains all the particles, then we will store the info of 10000 * 700 ~7 million particles. Currently, we only store ~700 particles of one sphere and sequentially loop over each sphere and reuse these ~700 particles.

Or we can build N ParticleContainers. Here N = the number of CPUs used in the simulation. And each ParticleContainer contains ~700 particles. The point here is the particles associated with one CPU may be distributed into other CPUs after using the redistribute() function. So it is hard to parallel.

At least, I think we should simultaneously calculate all spheres using all CPUs, instead of looping over each sphere sequentially.

Jordan

— Reply to this email directly, view it on GitHub https://github.com/AMReX-Codes/amrex/issues/4027#issuecomment-2219618651, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB37TYILH5B4BLDSKCHNMFDZLTD5JAVCNFSM6AAAAABKS7G4YKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJZGYYTQNRVGE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

S-Explorer commented 2 weeks ago

Hi Weiqun, Ann!

Tanks for your reply.

@WeiqunZhang

A particle has about 12 Real data 96 byte per particle, 10,000 * 700 particle need ~600Mb memory. At this stage we are running the program with 32 cores, each with 4GB of memory. We think more about the memory usage when the GPU accelerates. We want to save resources as much as possible. if we assign all particles, will it cause some issues while running it on GPUs?

WeiqunZhang commented 2 weeks ago

600 MB / # of GPUs seems small even if there is only one GPU.

atmyers commented 2 weeks ago

I agree with @asalmgren and @WeiqunZhang that you should just go ahead and store all the particles at once. The smaller of the two types of A100s on Perlmutter have 40 GB of device memory, so you could fit ~ 400 million particles on just one. In addition to scaling better with the number of ranks, you will expose more parallelism to the GPU if you do many interpolation operations at once, instead of just 700 at a time.

asalmgren commented 5 days ago

@S-Explorer -- closing this now due to inactivity -- please let us know if you have any more questions!