Warwick-Plasma / epoch

Particle-in-cell code for plasma physics simulations
https://epochpic.github.io
GNU General Public License v3.0
186 stars 59 forks source link

ARCHER2 - Automatic domain decomposition is failing (EPOCH not running on more than one node) #539

Open HollyHuddle opened 1 year ago

HollyHuddle commented 1 year ago

Hello,

I am having an issue with getting my input deck to run on archer2 on more than 1 node. I've contacted the archer2 support but wanted to see if anyone else has come across this issue. I built the latest version of the code with the following (guided by archer2 support)

mkdir epoch cd epoch git clone --recursive https://github.com/Warwick-Plasma/epoch mv epoch epoch-4.19.2 cd epoch-4.19.2 git checkout v4.19.2 Edited "./SDF/FORTRAN/Makefile" by replacing Archer/Hector with ARCHER2 cd epoch2d Edited "./Makefile" by replacing Archer/Hector with ARCHER2 and adding ** "-J../SDF/FORTRAN" to MODULEFLAG. module load cray-python export COMPILER=archer2 make

My input deck runs fine on 1 node with 128 cores, producing sdf files. When I run on more than 2 nodes+, it appears to run when I check the queue but there is absolutely no output, just an empty slurm output file.

input2dArcher.txt SubScript.txt

HollyHuddle commented 1 year ago

Additionally the archer2 support team have uncovered that MPI_recv errors occurduring the initial particle load. Running across two nodes with 64 cores on each node has the same error. Attached is one of the outputs of showing the error when trying to run on 2 cores, which appears as a 'PMPI_Recv: Message truncated' error. The same error occurs when the gfortran compiler is used to build the code instead of archer2.

slurmout.txt

Status-Mirror commented 1 year ago

Hey @HollyHuddle,

I've had a few people find this bug. For some reason, our automated domain decomposition isn't working with Archer2. In your attached input deck, you have a grid of 5000x5000 cells. If you run on 2 nodes with 128 cores on each, then you have 256 processing cores. I'm not sure what's going wrong here, but Archer2 is setting up grid of 256x1 processors, meaning each processor has a local grid of approximately 20x5000 cells. There seems to be a minimum limit on how many cells are allowed on a single core, and this is causing the break.

In EPOCH, you can manually specify the domain decomposition by using nprocx and nprocy in the control block. For a 256 core simulation, I set both of these to 16, and then I can run your input deck. The aim is to create a decomposition such that each processor has a grid with roughly $n_x=n_y$ cells.

Normally EPOCH automatically chooses the best decomposition, but this hasn't worked on Archer2 ever since their most recent upgrade, and I'm struggling to figure out why that is. There are other bugs preventing particle loading from binary files which may be related. I'll keep this issue open while I continue my debugging efforts, and you can use the nprocx, nprocy work-around until then.

Cheers, Stuart

Status-Mirror commented 1 year ago

I have combed through the scripts which decide the domain decomposition, and I actually believe they are working as intended. The unusual 256x1 decomposition is the "correct" answer for a density of the form shown here:

Number_density

However, EPOCH is unable to run when a 256x1 decomposition is chosen. To prevent this error from cropping up again, we need to perform the following actions:

I do not know exactly what causes the 256x1 crash yet, but I have found that nprocx and nprocy are set in both housekeeping/mpi_routines.F90 and housekeeping/balance.F90. The former sets a domain grid which minimises the perimeters of ranks in EPOCH2D, while the latter sets a domain grid which has roughly equal numbers of particles on each rank. I believe the check should be added to housekeeping/balance.F90.

Status-Mirror commented 1 year ago

The error itself occurs during a call to redistribute_domain in housekeeping/balance.F90, called by pre_balance_workload, which is itself called by pre_load_balance in housekeeping/setup.F90, which is called by epoch2d.F90.

Something about this domain decomposition seems to break the redistribute_domain subroutine.

Rank    xmin    xmax    x-cells   
0   1   242 242
1   243 486 244
2   487 729 243
3   730 973 244
4   974 1216    243
5   1217    1459    243
6   1460    1702    243
7   1703    1945    243
8   1946    2188    243
9   2189    2431    243
10  2432    2674    243
11  2675    2918    244
12  2919    3161    243
13  3162    3252    91
14  3253    3278    26
15  3279    3298    20
16  3299    3314    16
17  3315    3328    14
18  3329    3341    13
19  3342    3352    11
20  3353    3363    11
21  3364    3373    10
22  3374    3383    10
23  3384    3392    9
24  3393    3401    9
25  3402    3410    9
26  3411    3418    8
27  3419    3425    7
28  3426    3433    8
29  3434    3440    7
30  3441    3447    7
31  3448    3454    7
32  3455    3461    7
33  3462    3468    7
34  3469    3475    7
35  3476    3482    7
36  3483    3489    7
37  3490    3496    7
38  3497    3502    6
39  3503    3509    7
40  3510    3516    7
41  3517    3523    7
42  3524    3530    7
43  3531    3537    7
44  3538    3544    7
45  3545    3551    7
46  3552    3558    7
47  3559    3565    7
48  3566    3571    6
49  3572    3578    7
50  3579    3585    7
51  3586    3592    7
52  3593    3599    7
53  3600    3606    7
54  3607    3613    7
55  3614    3620    7
56  3621    3627    7
57  3628    3634    7
58  3635    3640    6
59  3641    3647    7
60  3648    3654    7
61  3655    3661    7
62  3662    3668    7
63  3669    3675    7
64  3676    3682    7
65  3683    3689    7
66  3690    3696    7
67  3697    3703    7
68  3704    3709    6
69  3710    3716    7
70  3717    3723    7
71  3724    3730    7
72  3731    3737    7
73  3738    3744    7
74  3745    3751    7
75  3752    3758    7
76  3759    3765    7
77  3766    3772    7
78  3773    3778    6
79  3779    3785    7
80  3786    3792    7
81  3793    3799    7
82  3800    3806    7
83  3807    3813    7
84  3814    3820    7
85  3821    3827    7
86  3828    3834    7
87  3835    3841    7
88  3842    3847    6
89  3848    3854    7
90  3855    3861    7
91  3862    3868    7
92  3869    3875    7
93  3876    3882    7
94  3883    3889    7
95  3890    3896    7
96  3897    3903    7
97  3904    3910    7
98  3911    3917    7
99  3918    3923    6
100 3924    3930    7
101 3931    3937    7
102 3938    3944    7
103 3945    3951    7
104 3952    3958    7
105 3959    3965    7
106 3966    3972    7
107 3973    3979    7
108 3980    3986    7
109 3987    3992    6
110 3993    3999    7
111 4000    4006    7
112 4007    4013    7
113 4014    4020    7
114 4021    4027    7
115 4028    4034    7
116 4035    4041    7
117 4042    4048    7
118 4049    4055    7
119 4056    4061    6
120 4062    4068    7
121 4069    4075    7
122 4076    4082    7
123 4083    4089    7
124 4090    4096    7
125 4097    4103    7
126 4104    4110    7
127 4111    4117    7
128 4118    4124    7
129 4125    4130    6
130 4131    4137    7
131 4138    4144    7
132 4145    4151    7
133 4152    4158    7
134 4159    4165    7
135 4166    4172    7
136 4173    4179    7
137 4180    4186    7
138 4187    4193    7
139 4194    4199    6
140 4200    4206    7
141 4207    4213    7
142 4214    4220    7
143 4221    4227    7
144 4228    4234    7
145 4235    4241    7
146 4242    4248    7
147 4249    4255    7
148 4256    4262    7
149 4263    4268    6
150 4269    4275    7
151 4276    4282    7
152 4283    4289    7
153 4290    4296    7
154 4297    4303    7
155 4304    4310    7
156 4311    4317    7
157 4318    4324    7
158 4325    4331    7
159 4332    4337    6
160 4338    4344    7
161 4345    4351    7
162 4352    4358    7
163 4359    4365    7
164 4366    4372    7
165 4373    4379    7
166 4380    4386    7
167 4387    4393    7
168 4394    4400    7
169 4401    4406    6
170 4407    4413    7
171 4414    4420    7
172 4421    4427    7
173 4428    4434    7
174 4435    4441    7
175 4442    4448    7
176 4449    4455    7
177 4456    4462    7
178 4463    4469    7
179 4470    4476    7
180 4477    4482    6
181 4483    4489    7
182 4490    4496    7
183 4497    4503    7
184 4504    4510    7
185 4511    4517    7
186 4518    4524    7
187 4525    4531    7
188 4532    4538    7
189 4539    4545    7
190 4546    4551    6
191 4552    4558    7
192 4559    4565    7
193 4566    4572    7
194 4573    4579    7
195 4580    4586    7
196 4587    4593    7
197 4594    4600    7
198 4601    4607    7
199 4608    4614    7
200 4615    4620    6
201 4621    4627    7
202 4628    4634    7
203 4635    4641    7
204 4642    4648    7
205 4649    4655    7
206 4656    4662    7
207 4663    4669    7
208 4670    4676    7
209 4677    4683    7
210 4684    4689    6
211 4690    4696    7
212 4697    4703    7
213 4704    4710    7
214 4711    4717    7
215 4718    4724    7
216 4725    4731    7
217 4732    4738    7
218 4739    4745    7
219 4746    4752    7
220 4753    4758    6
221 4759    4765    7
222 4766    4772    7
223 4773    4779    7
224 4780    4786    7
225 4787    4793    7
226 4794    4800    7
227 4801    4807    7
228 4808    4814    7
229 4815    4821    7
230 4822    4827    6
231 4828    4834    7
232 4835    4841    7
233 4842    4848    7
234 4849    4855    7
235 4856    4862    7
236 4863    4869    7
237 4870    4876    7
238 4877    4883    7
239 4884    4890    7
240 4891    4896    6
241 4897    4903    7
242 4904    4910    7
243 4911    4917    7
244 4918    4924    7
245 4925    4931    7
246 4932    4938    7
247 4939    4945    7
248 4946    4952    7
249 4953    4959    7
250 4960    4965    6
251 4966    4972    7
252 4973    4979    7
253 4980    4986    7
254 4987    4993    7
255 4994    5000    7
HollyHuddle commented 1 year ago

Hi Stuart, thanks for giving me a quick solution to get this running!

Status-Mirror commented 1 year ago

No problem! You can ignore my posts on this thread for now - I'm just recording some debugging info in case I need to pass this issue on to someone else.

Earlier on I gave the domain decomposition for the 256x1 rank simulation which failed - here, all ranks spanned 5000 cells in y. At the bottom of this message, I give the domain decomposition for the 16x16 grid which does not fail. Interestingly, the 16x16 simulation has the largest local domain (rank 0 has 6.80e6 cells, compared to the largest 1.22e6 cells (rank 1) in the 256x1 simulation). Also, the smallest rank on 256x1 has 3.0e5 cells, while the smallest 16x16 has 1.2e5 cells. These suggest the absolute number of cells on a rank does not cause this bug, as the simulation with the highest cell-count rank and the lowest cell-count rank still runs.

On 16x16, the lowest number of cells a rank has in the x and y directions are 110 (rank 4) and 110 (rank 14) respectively. On 256x1, these are 6 (rank 38) and 5000 (all). Could it be that 6 cells is too small for a rank? Maybe 5000 cells is too large for the MPI scripts to work?

        Rank        xmin        xmax        ymin        ymax        area
           0           1        3297           1        2062     6798414
           1        3298        3454        2063        2232       26690
           2        3455        3565        2063        2232       18870
           3        3566        3675        2233        2351       13090
           4        3676        3785        2233        2351       13090
           5        3786        3896           1        2062      228882
           6        3897        4006        2352        2462       12210
           7        4007        4117        2352        2462       12321
           8        4118        4227        2352        2462       12210
           9        4228        4337        2352        2462       12210
          10        4338        4448        2352        2462       12321
          11        4449        4558        2352        2462       12210
          12        4559        4669        2352        2462       12321
          13        4670        4779        2463        2572       12100
          14        4780        4890        2463        2572       12210
          15        4891        5000        2463        2572       12100
          16           1        3297        2463        2572      362670
          17        3298        3454        2463        2572       17270
          18        3455        3565        2463        2572       12210
          19        3566        3675        2463        2572       12100
          20        3676        3785        2463        2572       12100
          21        3786        3896        2573        2683       12321
          22        3897        4006        2573        2683       12210
          23        4007        4117        2573        2683       12321
          24        4118        4227        2573        2683       12210
          25        4228        4337        2573        2683       12210
          26        4338        4448        2573        2683       12321
          27        4449        4558        2573        2683       12210
          28        4559        4669        2573        2683       12321
          29        4670        4779        2684        2793       12100
          30        4780        4890        2684        2793       12210
          31        4891        5000        2684        2793       12100
          32           1        3297        2684        2793      362670
          33        3298        3454        2684        2793       17270
          34        3455        3565        2684        2793       12210
          35        3566        3675        2684        2793       12100
          36        3676        3785        2684        2793       12100
          37        3786        3896        2794        2903       12210
          38        3897        4006        2794        2903       12100
          39        4007        4117        2794        2903       12210
          40        4118        4227        2794        2903       12100
          41        4228        4337        2794        2903       12100
          42        4338        4448        2794        2903       12210
          43        4449        4558        2794        2903       12100
          44        4559        4669        2794        2903       12210
          45        4670        4779        2904        3014       12210
          46        4780        4890        2904        3014       12321
          47        4891        5000        2904        3014       12210
          48           1        3297        2904        3014      365967
          49        3298        3454        2904        3014       17427
          50        3455        3565        2904        3014       12321
          51        3566        3675        2904        3014       12210
          52        3676        3785        2904        3014       12210
          53        3786        3896        3015        3124       12210
          54        3897        4006        3015        3124       12100
          55        4007        4117        3015        3124       12210
          56        4118        4227        3015        3124       12100
          57        4228        4337        3015        3124       12100
          58        4338        4448        3015        3124       12210
          59        4449        4558        3015        3124       12100
          60        4559        4669        3015        3124       12210
          61        4670        4779        3125        3235       12210
          62        4780        4890        3125        3235       12321
          63        4891        5000        3125        3235       12210
          64           1        3297        3125        3235      365967
          65        3298        3454        3125        3235       17427
          66        3455        3565        3125        3235       12321
          67        3566        3675        3125        3235       12210
          68        3676        3785        3125        3235       12210
          69        3786        3896        3125        3235       12321
          70        3897        4006        3236        3345       12100
          71        4007        4117        3125        3235       12321
          72        4118        4227        3236        3345       12100
          73        4228        4337        3236        3345       12100
          74        4338        4448        3236        3345       12210
          75        4449        4558        3236        3345       12100
          76        4559        4669        3236        3345       12210
          77        4670        4779        3236        3345       12100
          78        4780        4890        3236        3345       12210
          79        4891        5000        3236        3345       12100
          80           1        3297        3236        3345      362670
          81        3298        3454        3236        3345       17270
          82        3455        3565        3236        3345       12210
          83        3566        3675        3236        3345       12100
          84        3676        3785        3236        3345       12100
          85        3786        3896        3236        3345       12210
          86        3897        4006        3346        3455       12100
          87        4007        4117        3236        3345       12210
          88        4118        4227        3346        3455       12100
          89        4228        4337        3346        3455       12100
          90        4338        4448        3346        3455       12210
          91        4449        4558        3346        3455       12100
          92        4559        4669        3346        3455       12210
          93        4670        4779        3346        3455       12100
          94        4780        4890        3346        3455       12210
          95        4891        5000        3346        3455       12100
          96           1        3297        3346        3455      362670
          97        3298        3454        3346        3455       17270
          98        3455        3565        3346        3455       12210
          99        3566        3675        3346        3455       12100
         100        3676        3785        3346        3455       12100
         101        3786        3896        3346        3455       12210
         102        3897        4006        3456        3566       12210
         103        4007        4117        3346        3455       12210
         104        4118        4227        3456        3566       12210
         105        4228        4337        3456        3566       12210
         106        4338        4448        3456        3566       12321
         107        4449        4558        3456        3566       12210
         108        4559        4669        3456        3566       12321
         109        4670        4779        3456        3566       12210
         110        4780        4890        3456        3566       12321
         111        4891        5000        3456        3566       12210
         112           1        3297        3456        3566      365967
         113        3298        3454        3456        3566       17427
         114        3455        3565        3456        3566       12321
         115        3566        3675        3456        3566       12210
         116        3676        3785        3456        3566       12210
         117        3786        3896        3456        3566       12321
         118        3897        4006        3567        3682       12760
         119        4007        4117        3456        3566       12321
         120        4118        4227        3567        3682       12760
         121        4228        4337        3567        3682       12760
         122        4338        4448        3567        3682       12876
         123        4449        4558        3567        3682       12760
         124        4559        4669        3567        3682       12876
         125        4670        4779        3567        3682       12760
         126        4780        4890        3567        3682       12876
         127        4891        5000        3567        3682       12760
         128           1        3297        3567        3682      382452
         129        3298        3454        3567        3682       18212
         130        3455        3565        3567        3682       12876
         131        3566        3675        3567        3682       12760
         132        3676        3785        3567        3682       12760
         133        3786        3896        3567        3682       12876
         134        3897        4006        3683        5000      144980
         135        4007        4117        3567        3682       12876
         136        4118        4227        3683        5000      144980
         137        4228        4337        3683        5000      144980
         138        4338        4448        3683        5000      146298
         139        4449        4558        3683        5000      144980
         140        4559        4669        3683        5000      146298
         141        4670        4779        3683        5000      144980
         142        4780        4890        3683        5000      146298
         143        4891        5000        3683        5000      144980
         144           1        3297        3683        5000     4345446
         145        3298        3454        3683        5000      206926
         146        3455        3565        3683        5000      146298
         147        3566        3675        3683        5000      144980
         148        3676        3785        3683        5000      144980
         149        3786        3896        3683        5000      146298
         150        3897        4006           1        2062      226820
         151        4007        4117        3683        5000      146298
         152        4118        4227           1        2062      226820
         153        4228        4337           1        2062      226820
         154        4338        4448           1        2062      228882
         155        4449        4558           1        2062      226820
         156        4559        4669           1        2062      228882
         157        4670        4779           1        2062      226820
         158        4780        4890           1        2062      228882
         159        4891        5000           1        2062      226820
         160           1        3297           1        2062     6798414
         161        3298        3454           1        2062      323734
         162        3455        3565           1        2062      228882
         163        3566        3675           1        2062      226820
         164        3676        3785           1        2062      226820
         165        3786        3896        2063        2232       18870
         166        3897        4006        2063        2232       18700
         167        4007        4117        2063        2232       18870
         168        4118        4227        2063        2232       18700
         169        4228        4337        2063        2232       18700
         170        4338        4448        2063        2232       18870
         171        4449        4558        2063        2232       18700
         172        4559        4669        2063        2232       18870
         173        4670        4779        2063        2232       18700
         174        4780        4890        2063        2232       18870
         175        4891        5000        2063        2232       18700
         176           1        3297        2063        2232      560490
         177        3298        3454        2063        2232       26690
         178        3455        3565        2233        2351       13209
         179        3566        3675        2063        2232       18700
         180        3676        3785        2233        2351       13090
         181        3786        3896        2233        2351       13209
         182        3897        4006        2233        2351       13090
         183        4007        4117        2233        2351       13209
         184        4118        4227        2233        2351       13090
         185        4228        4337        2233        2351       13090
         186        4338        4448        2233        2351       13209
         187        4449        4558        2233        2351       13090
         188        4559        4669        2233        2351       13209
         189        4670        4779        2233        2351       13090
         190        4780        4890        2352        2462       12321
         191        4891        5000        2233        2351       13090
         192           1        3297        2352        2462      365967
         193        3298        3454        2233        2351       18683
         194        3455        3565        2352        2462       12321
         195        3566        3675        2233        2351       13090
         196        3676        3785        2352        2462       12210
         197        3786        3896        2352        2462       12321
         198        3897        4006        2463        2572       12100
         199        4007        4117        2352        2462       12321
         200        4118        4227        2463        2572       12100
         201        4228        4337        2352        2462       12210
         202        4338        4448        2463        2572       12210
         203        4449        4558        2352        2462       12210
         204        4559        4669        2463        2572       12210
         205        4670        4779        2352        2462       12210
         206        4780        4890        2573        2683       12321
         207        4891        5000        2573        2683       12210
         208           1        3297        2463        2572      362670
         209        3298        3454        2573        2683       17427
         210        3455        3565        2463        2572       12210
         211        3566        3675        2573        2683       12210
         212        3676        3785        2463        2572       12100
         213        3786        3896        2684        2793       12210
         214        3897        4006        2463        2572       12100
         215        4007        4117        2684        2793       12210
         216        4118        4227        2573        2683       12210
         217        4228        4337        2684        2793       12100
         218        4338        4448        2573        2683       12321
         219        4449        4558        2684        2793       12100
         220        4559        4669        2573        2683       12321
         221        4670        4779        2794        2903       12100
         222        4780        4890        2573        2683       12321
         223        4891        5000        2794        2903       12100
         224           1        3297        2684        2793      362670
         225        3298        3454        2794        2903       17270
         226        3455        3565        2684        2793       12210
         227        3566        3675        2794        2903       12100
         228        3676        3785        2684        2793       12100
         229        3786        3896        2904        3014       12321
         230        3897        4006        2684        2793       12100
         231        4007        4117        2904        3014       12321
         232        4118        4227        2794        2903       12100
         233        4228        4337        2904        3014       12210
         234        4338        4448        2794        2903       12210
         235        4449        4558        2904        3014       12210
         236        4559        4669        2794        2903       12210
         237        4670        4779        3015        3124       12100
         238        4780        4890        2794        2903       12210
         239        4891        5000        3015        3124       12100
         240           1        3297        2904        3014      365967
         241        3298        3454        3015        3124       17270
         242        3455        3565        2904        3014       12321
         243        3566        3675        3015        3124       12100
         244        3676        3785        2904        3014       12210
         245        3786        3896        3125        3235       12321
         246        3897        4006        2904        3014       12210
         247        4007        4117        3125        3235       12321
         248        4118        4227        3015        3124       12100
         249        4228        4337        3015        3124       12100
         250        4338        4448        3015        3124       12210
         251        4449        4558        3015        3124       12100
         252        4559        4669        3125        3235       12321
         253        4670        4779        3125        3235       12210
         254        4780        4890        3125        3235       12321
         255        4891        5000        3125        3235       12210
taliameir commented 10 months ago

Hey, I'm encountering the same issue as discussed here. I use the Archer compiler, and I resolved it in 2D using the method you described with nprocx and nprocy. (thanks for that) However, in 3D, when I attempted to use procx, nprocy, and nprocz, I encountered this error:

aborting job: Fatal error in PMPI_Dims_create: Invalid dimension argument, error stack: PMPI_Dims_create(909): MPI_Dims_create(nnodes=2560, ndims=3, dims=0x7ffe03f7157c) failed PMPI_Dims_create(897): MPIR_Dims_create(625): Cannot partition nodes as requested MPICH ERROR [Rank 1284] [job id 5727692.0] [Mon Jan 15 20:42:52 2024] [nid002874] - Abort(739332619) (rank 1284 in comm 0): Fatal error in PMPI_Dims_create: Invalid dimension argument, error stack: PMPI_Dims_create(909): MPI_Dims_create(nnodes=2560, ndims=3, dims=0x7ffe381661fc) failed PMPI_Dims_create(897): MPIR_Dims_create(625): Cannot partition nodes as requested

And when I don't defined nprocx, nprocy, nprocz I have this error:

aborting job: Fatal error in PMPI_Recv: Message truncated, error stack: PMPI_Recv(177)....: MPI_Recv(buf=0x3cba600, count=1, dtype=USER, src=2557, tag=0, comm=0xc4000001, status=0x1) failed progress_recv(232): Message from rank 2558 and tag 0 truncated; 2496 bytes received but buffer size is 2688 (unknown)(): Message truncated MPICH ERROR [Rank 2541] [job id 5648765.0] [Fri Jan 12 11:17:51 2024] [nid001754] - Abort(537511950) (rank 2541 in comm 0): Fatal error in PMPI_Recv: Message truncated, error stack: PMPI_Recv(177)....: MPI_Recv(buf=0x3c17540, count=1, dtype=USER, src=2557, tag=0, comm=0xc4000001, status=0x1) failed progress_recv(232): Message from rank 2558 and tag 0 truncated; 936 bytes received but buffer size is 1008 (unknown)(): Message truncated

Status-Mirror commented 10 months ago

Hey @taliameir,

There's something unusual going on with the Archer compiler - our default stopped working after one of their updates, so I'm aware of your second bug.

The nproc workaround should work though. Your error message suggests nprocx * nprocy * nprocz doesn't equal your requested core count. Can you send the input deck and submission script?

Cheers, Stuart