FluidityProject / fluidity

Fluidity
http://fluidity-project.org
Other
365 stars 115 forks source link

empty partition in Zoltan_integration.F90 #386

Open xiangbei007 opened 1 month ago

xiangbei007 commented 1 month ago

When I run the backward_facing_step_3d example and set NPROCS to a value greater than 128 and run the program on two nodes using make run, I encounter empty partitions, which cause the program to terminate. This occurs regardless of whether I use the graph partitioning algorithm parmetis or the hypergraph partitioning algorithm PHG.

In summary, I would like to ask for suggestions on how to improve the scalability of the program, meaning how to prevent zoltan_load_balance function from generating empty partitions when the number of processes is increased.

jhill1 commented 1 month ago

Hi,

The standard BFS-3D mesh has about 530,000 elements to start with. On 128 cores that's around 4000 elements per core. Adaptivity changes that throughout the run, but I can't remember the rough numbers. The number of elements is key to how many cores the model can run on and my rule of thumb for fluidity is around 10,000 elements per core (on a CG discretisation). Check out the stat file and plot the number of elements in the run at the time of failure; that should give you an indication of how many cores is realistic.

Hope that helps, Jon

On Mon, 15 Jul 2024 at 08:55, Wangbo @.***> wrote:

When I run the backward_facing_step_3d example and set NPROCS to a value greater than 128 and run the program on two nodes using make run, I encounter empty partitions, which cause the program to terminate. This occurs regardless of whether I use the graph partitioning algorithm parmetis or the hypergraph partitioning algorithm PHG.

In summary, I would like to ask for suggestions on how to improve the scalability of the program, meaning how to prevent zoltan_load_balance function from generating empty partitions when the number of processes is increased.

— Reply to this email directly, view it on GitHub https://github.com/FluidityProject/fluidity/issues/386, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDFJKPYJLYYVFJZDAYXCHLZMN57DAVCNFSM6AAAAABK4AMFEKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQYDQMBZHE2DEMA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Dr Jon Hill Senior Lecturer in Physical Geography Chair of Board of Examiners Department of Environment and Geography University of York M: +44(0)7748254812 Web: https://jonxhill.wordpress.com/ Web: https://envmodellinggroup.github.io/

https://envmodellinggroup.github.io/

xiangbei007 commented 1 month ago

Thank you for your reply.

Do you mean that the occurrence of empty partitions is not related to the algorithm of the partitioner? I mean, for the BFS-3D example, the input file does not specify the partitioner, and the default is used, which is zoltan_graph + phg+ PARTITION. In this case, it can only run on 128 processors. However, when I specify the partitioner in the backward_facing_step3d.flml file, which is HYPERGRAPH + PHG + REPARTITION, fluidity can run on 1024 processors, a total of 16 nodes. However, when the processors increase to 2048, fluidity will still abort. When I switched to PARMETIS, the performance was not as good as HYPERGRAPH + PHG. The graph partitioning algorithms used by parmetis and PHG are both multilevel graph partition methods.

My previous thought was that there was a problem with fluidity when it called zoltan for load balancing, which would result in empty partitions. So I would like to ask if anyone is familiar with this part and can tell me the suggestions about how to fix this problem. Because the intuitive feeling is that after graph partitioning, each part will have at least some vertices.Even in the case of load imbalance, it should not result in empty partitions.

My current work is to run fluidity on a large number of processors, even more than 10,000 processors. The current work is stuck at the zoltan_load_balance generating empty partition part. I would like to ask if you have any suggestions.

Thank you very much.

jhill1 commented 1 month ago

The empty partition issue is known in the zoltan code and is a result of not having enough elements per processor for the zoltan_graph algorithm http://www.hector.ac.uk/cse/distributedcse/reports/fluidity-zoltan/fluidity-zoltan.pdf. We had plans a long time ago to resolve this by ignoring the empty partition in the calculation, but those plans never got to fruition. We did have a work around which was to mess with the load imbalance tolerance. You might want to play with that manually to see if you can get around the empty partitions.

It's been a while since we implemented this code, so my memory is a bit rusty!

That might help you?

On Mon, 15 Jul 2024 at 10:01, Wangbo @.***> wrote:

Thank you for your reply.

Do you mean that the occurrence of empty partitions is not related to the algorithm of the partitioner? I mean, for the BFS-3D example, the input file does not specify the partitioner, and the default is used, which is zoltan_graph + phg+ PARTITION. In this case, it can only run on 128 processors. However, when I specify the partitioner in the backward_facing_step3d.flml file, which is HYPERGRAPH + PHG + REPARTITION, fluidity can run on 1024 processors, a total of 16 nodes. However, when the processors increase to 2048, fluidity will still abort. When I switched to PARMETIS, the performance was not as good as HYPERGRAPH + PHG. The graph partitioning algorithms used by parmetis and PHG are both multilevel graph partition methods.

My previous thought was that there was a problem with fluidity when it called zoltan for load balancing, which would result in empty partitions. So I would like to ask if anyone is familiar with this part and can tell me the suggestions about how to fix this problem. Because the intuitive feeling is that after graph partitioning, each part will have at least some vertices.Even in the case of load imbalance, it should not result in empty partitions.

My current work is to run fluidity on a large number of processors, even more than 10,000 processors. The current work is stuck at the zoltan_load_balance generating empty partition part. I would like to ask if you have any suggestions.

Thank you very much.

— Reply to this email directly, view it on GitHub https://github.com/FluidityProject/fluidity/issues/386#issuecomment-2228015859, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABDFJKPMON2THLTIHWSHG4LZMOFXBAVCNFSM6AAAAABK4AMFEKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRYGAYTKOBVHE . You are receiving this because you commented.Message ID: @.***>

-- Dr Jon Hill Senior Lecturer in Physical Geography Chair of Board of Examiners Department of Environment and Geography University of York M: +44(0)7748254812 Web: https://jonxhill.wordpress.com/ Web: https://envmodellinggroup.github.io/

https://envmodellinggroup.github.io/