Open HollyHuddle opened 1 year ago
Additionally the archer2 support team have uncovered that MPI_recv errors occurduring the initial particle load. Running across two nodes with 64 cores on each node has the same error. Attached is one of the outputs of showing the error when trying to run on 2 cores, which appears as a 'PMPI_Recv: Message truncated' error. The same error occurs when the gfortran compiler is used to build the code instead of archer2.
Hey @HollyHuddle,
I've had a few people find this bug. For some reason, our automated domain decomposition isn't working with Archer2. In your attached input deck, you have a grid of 5000x5000 cells. If you run on 2 nodes with 128 cores on each, then you have 256 processing cores. I'm not sure what's going wrong here, but Archer2 is setting up grid of 256x1 processors, meaning each processor has a local grid of approximately 20x5000 cells. There seems to be a minimum limit on how many cells are allowed on a single core, and this is causing the break.
In EPOCH, you can manually specify the domain decomposition by using nprocx
and nprocy
in the control block. For a 256 core simulation, I set both of these to 16, and then I can run your input deck. The aim is to create a decomposition such that each processor has a grid with roughly $n_x=n_y$ cells.
Normally EPOCH automatically chooses the best decomposition, but this hasn't worked on Archer2 ever since their most recent upgrade, and I'm struggling to figure out why that is. There are other bugs preventing particle loading from binary files which may be related. I'll keep this issue open while I continue my debugging efforts, and you can use the nprocx
, nprocy
work-around until then.
Cheers, Stuart
I have combed through the scripts which decide the domain decomposition, and I actually believe they are working as intended. The unusual 256x1 decomposition is the "correct" answer for a density of the form shown here:
However, EPOCH is unable to run when a 256x1 decomposition is chosen. To prevent this error from cropping up again, we need to perform the following actions:
nprocx
and nprocy
. I do not know exactly what causes the 256x1 crash yet, but I have found that nprocx
and nprocy
are set in both housekeeping/mpi_routines.F90
and housekeeping/balance.F90
. The former sets a domain grid which minimises the perimeters of ranks in EPOCH2D, while the latter sets a domain grid which has roughly equal numbers of particles on each rank. I believe the check should be added to housekeeping/balance.F90
.
The error itself occurs during a call to redistribute_domain
in housekeeping/balance.F90
, called by pre_balance_workload
, which is itself called by pre_load_balance
in housekeeping/setup.F90
, which is called by epoch2d.F90
.
Something about this domain decomposition seems to break the redistribute_domain
subroutine.
Rank xmin xmax x-cells
0 1 242 242
1 243 486 244
2 487 729 243
3 730 973 244
4 974 1216 243
5 1217 1459 243
6 1460 1702 243
7 1703 1945 243
8 1946 2188 243
9 2189 2431 243
10 2432 2674 243
11 2675 2918 244
12 2919 3161 243
13 3162 3252 91
14 3253 3278 26
15 3279 3298 20
16 3299 3314 16
17 3315 3328 14
18 3329 3341 13
19 3342 3352 11
20 3353 3363 11
21 3364 3373 10
22 3374 3383 10
23 3384 3392 9
24 3393 3401 9
25 3402 3410 9
26 3411 3418 8
27 3419 3425 7
28 3426 3433 8
29 3434 3440 7
30 3441 3447 7
31 3448 3454 7
32 3455 3461 7
33 3462 3468 7
34 3469 3475 7
35 3476 3482 7
36 3483 3489 7
37 3490 3496 7
38 3497 3502 6
39 3503 3509 7
40 3510 3516 7
41 3517 3523 7
42 3524 3530 7
43 3531 3537 7
44 3538 3544 7
45 3545 3551 7
46 3552 3558 7
47 3559 3565 7
48 3566 3571 6
49 3572 3578 7
50 3579 3585 7
51 3586 3592 7
52 3593 3599 7
53 3600 3606 7
54 3607 3613 7
55 3614 3620 7
56 3621 3627 7
57 3628 3634 7
58 3635 3640 6
59 3641 3647 7
60 3648 3654 7
61 3655 3661 7
62 3662 3668 7
63 3669 3675 7
64 3676 3682 7
65 3683 3689 7
66 3690 3696 7
67 3697 3703 7
68 3704 3709 6
69 3710 3716 7
70 3717 3723 7
71 3724 3730 7
72 3731 3737 7
73 3738 3744 7
74 3745 3751 7
75 3752 3758 7
76 3759 3765 7
77 3766 3772 7
78 3773 3778 6
79 3779 3785 7
80 3786 3792 7
81 3793 3799 7
82 3800 3806 7
83 3807 3813 7
84 3814 3820 7
85 3821 3827 7
86 3828 3834 7
87 3835 3841 7
88 3842 3847 6
89 3848 3854 7
90 3855 3861 7
91 3862 3868 7
92 3869 3875 7
93 3876 3882 7
94 3883 3889 7
95 3890 3896 7
96 3897 3903 7
97 3904 3910 7
98 3911 3917 7
99 3918 3923 6
100 3924 3930 7
101 3931 3937 7
102 3938 3944 7
103 3945 3951 7
104 3952 3958 7
105 3959 3965 7
106 3966 3972 7
107 3973 3979 7
108 3980 3986 7
109 3987 3992 6
110 3993 3999 7
111 4000 4006 7
112 4007 4013 7
113 4014 4020 7
114 4021 4027 7
115 4028 4034 7
116 4035 4041 7
117 4042 4048 7
118 4049 4055 7
119 4056 4061 6
120 4062 4068 7
121 4069 4075 7
122 4076 4082 7
123 4083 4089 7
124 4090 4096 7
125 4097 4103 7
126 4104 4110 7
127 4111 4117 7
128 4118 4124 7
129 4125 4130 6
130 4131 4137 7
131 4138 4144 7
132 4145 4151 7
133 4152 4158 7
134 4159 4165 7
135 4166 4172 7
136 4173 4179 7
137 4180 4186 7
138 4187 4193 7
139 4194 4199 6
140 4200 4206 7
141 4207 4213 7
142 4214 4220 7
143 4221 4227 7
144 4228 4234 7
145 4235 4241 7
146 4242 4248 7
147 4249 4255 7
148 4256 4262 7
149 4263 4268 6
150 4269 4275 7
151 4276 4282 7
152 4283 4289 7
153 4290 4296 7
154 4297 4303 7
155 4304 4310 7
156 4311 4317 7
157 4318 4324 7
158 4325 4331 7
159 4332 4337 6
160 4338 4344 7
161 4345 4351 7
162 4352 4358 7
163 4359 4365 7
164 4366 4372 7
165 4373 4379 7
166 4380 4386 7
167 4387 4393 7
168 4394 4400 7
169 4401 4406 6
170 4407 4413 7
171 4414 4420 7
172 4421 4427 7
173 4428 4434 7
174 4435 4441 7
175 4442 4448 7
176 4449 4455 7
177 4456 4462 7
178 4463 4469 7
179 4470 4476 7
180 4477 4482 6
181 4483 4489 7
182 4490 4496 7
183 4497 4503 7
184 4504 4510 7
185 4511 4517 7
186 4518 4524 7
187 4525 4531 7
188 4532 4538 7
189 4539 4545 7
190 4546 4551 6
191 4552 4558 7
192 4559 4565 7
193 4566 4572 7
194 4573 4579 7
195 4580 4586 7
196 4587 4593 7
197 4594 4600 7
198 4601 4607 7
199 4608 4614 7
200 4615 4620 6
201 4621 4627 7
202 4628 4634 7
203 4635 4641 7
204 4642 4648 7
205 4649 4655 7
206 4656 4662 7
207 4663 4669 7
208 4670 4676 7
209 4677 4683 7
210 4684 4689 6
211 4690 4696 7
212 4697 4703 7
213 4704 4710 7
214 4711 4717 7
215 4718 4724 7
216 4725 4731 7
217 4732 4738 7
218 4739 4745 7
219 4746 4752 7
220 4753 4758 6
221 4759 4765 7
222 4766 4772 7
223 4773 4779 7
224 4780 4786 7
225 4787 4793 7
226 4794 4800 7
227 4801 4807 7
228 4808 4814 7
229 4815 4821 7
230 4822 4827 6
231 4828 4834 7
232 4835 4841 7
233 4842 4848 7
234 4849 4855 7
235 4856 4862 7
236 4863 4869 7
237 4870 4876 7
238 4877 4883 7
239 4884 4890 7
240 4891 4896 6
241 4897 4903 7
242 4904 4910 7
243 4911 4917 7
244 4918 4924 7
245 4925 4931 7
246 4932 4938 7
247 4939 4945 7
248 4946 4952 7
249 4953 4959 7
250 4960 4965 6
251 4966 4972 7
252 4973 4979 7
253 4980 4986 7
254 4987 4993 7
255 4994 5000 7
Hi Stuart, thanks for giving me a quick solution to get this running!
No problem! You can ignore my posts on this thread for now - I'm just recording some debugging info in case I need to pass this issue on to someone else.
Earlier on I gave the domain decomposition for the 256x1 rank simulation which failed - here, all ranks spanned 5000 cells in y. At the bottom of this message, I give the domain decomposition for the 16x16 grid which does not fail. Interestingly, the 16x16 simulation has the largest local domain (rank 0 has 6.80e6 cells, compared to the largest 1.22e6 cells (rank 1) in the 256x1 simulation). Also, the smallest rank on 256x1 has 3.0e5 cells, while the smallest 16x16 has 1.2e5 cells. These suggest the absolute number of cells on a rank does not cause this bug, as the simulation with the highest cell-count rank and the lowest cell-count rank still runs.
On 16x16, the lowest number of cells a rank has in the x and y directions are 110 (rank 4) and 110 (rank 14) respectively. On 256x1, these are 6 (rank 38) and 5000 (all). Could it be that 6 cells is too small for a rank? Maybe 5000 cells is too large for the MPI scripts to work?
Rank xmin xmax ymin ymax area
0 1 3297 1 2062 6798414
1 3298 3454 2063 2232 26690
2 3455 3565 2063 2232 18870
3 3566 3675 2233 2351 13090
4 3676 3785 2233 2351 13090
5 3786 3896 1 2062 228882
6 3897 4006 2352 2462 12210
7 4007 4117 2352 2462 12321
8 4118 4227 2352 2462 12210
9 4228 4337 2352 2462 12210
10 4338 4448 2352 2462 12321
11 4449 4558 2352 2462 12210
12 4559 4669 2352 2462 12321
13 4670 4779 2463 2572 12100
14 4780 4890 2463 2572 12210
15 4891 5000 2463 2572 12100
16 1 3297 2463 2572 362670
17 3298 3454 2463 2572 17270
18 3455 3565 2463 2572 12210
19 3566 3675 2463 2572 12100
20 3676 3785 2463 2572 12100
21 3786 3896 2573 2683 12321
22 3897 4006 2573 2683 12210
23 4007 4117 2573 2683 12321
24 4118 4227 2573 2683 12210
25 4228 4337 2573 2683 12210
26 4338 4448 2573 2683 12321
27 4449 4558 2573 2683 12210
28 4559 4669 2573 2683 12321
29 4670 4779 2684 2793 12100
30 4780 4890 2684 2793 12210
31 4891 5000 2684 2793 12100
32 1 3297 2684 2793 362670
33 3298 3454 2684 2793 17270
34 3455 3565 2684 2793 12210
35 3566 3675 2684 2793 12100
36 3676 3785 2684 2793 12100
37 3786 3896 2794 2903 12210
38 3897 4006 2794 2903 12100
39 4007 4117 2794 2903 12210
40 4118 4227 2794 2903 12100
41 4228 4337 2794 2903 12100
42 4338 4448 2794 2903 12210
43 4449 4558 2794 2903 12100
44 4559 4669 2794 2903 12210
45 4670 4779 2904 3014 12210
46 4780 4890 2904 3014 12321
47 4891 5000 2904 3014 12210
48 1 3297 2904 3014 365967
49 3298 3454 2904 3014 17427
50 3455 3565 2904 3014 12321
51 3566 3675 2904 3014 12210
52 3676 3785 2904 3014 12210
53 3786 3896 3015 3124 12210
54 3897 4006 3015 3124 12100
55 4007 4117 3015 3124 12210
56 4118 4227 3015 3124 12100
57 4228 4337 3015 3124 12100
58 4338 4448 3015 3124 12210
59 4449 4558 3015 3124 12100
60 4559 4669 3015 3124 12210
61 4670 4779 3125 3235 12210
62 4780 4890 3125 3235 12321
63 4891 5000 3125 3235 12210
64 1 3297 3125 3235 365967
65 3298 3454 3125 3235 17427
66 3455 3565 3125 3235 12321
67 3566 3675 3125 3235 12210
68 3676 3785 3125 3235 12210
69 3786 3896 3125 3235 12321
70 3897 4006 3236 3345 12100
71 4007 4117 3125 3235 12321
72 4118 4227 3236 3345 12100
73 4228 4337 3236 3345 12100
74 4338 4448 3236 3345 12210
75 4449 4558 3236 3345 12100
76 4559 4669 3236 3345 12210
77 4670 4779 3236 3345 12100
78 4780 4890 3236 3345 12210
79 4891 5000 3236 3345 12100
80 1 3297 3236 3345 362670
81 3298 3454 3236 3345 17270
82 3455 3565 3236 3345 12210
83 3566 3675 3236 3345 12100
84 3676 3785 3236 3345 12100
85 3786 3896 3236 3345 12210
86 3897 4006 3346 3455 12100
87 4007 4117 3236 3345 12210
88 4118 4227 3346 3455 12100
89 4228 4337 3346 3455 12100
90 4338 4448 3346 3455 12210
91 4449 4558 3346 3455 12100
92 4559 4669 3346 3455 12210
93 4670 4779 3346 3455 12100
94 4780 4890 3346 3455 12210
95 4891 5000 3346 3455 12100
96 1 3297 3346 3455 362670
97 3298 3454 3346 3455 17270
98 3455 3565 3346 3455 12210
99 3566 3675 3346 3455 12100
100 3676 3785 3346 3455 12100
101 3786 3896 3346 3455 12210
102 3897 4006 3456 3566 12210
103 4007 4117 3346 3455 12210
104 4118 4227 3456 3566 12210
105 4228 4337 3456 3566 12210
106 4338 4448 3456 3566 12321
107 4449 4558 3456 3566 12210
108 4559 4669 3456 3566 12321
109 4670 4779 3456 3566 12210
110 4780 4890 3456 3566 12321
111 4891 5000 3456 3566 12210
112 1 3297 3456 3566 365967
113 3298 3454 3456 3566 17427
114 3455 3565 3456 3566 12321
115 3566 3675 3456 3566 12210
116 3676 3785 3456 3566 12210
117 3786 3896 3456 3566 12321
118 3897 4006 3567 3682 12760
119 4007 4117 3456 3566 12321
120 4118 4227 3567 3682 12760
121 4228 4337 3567 3682 12760
122 4338 4448 3567 3682 12876
123 4449 4558 3567 3682 12760
124 4559 4669 3567 3682 12876
125 4670 4779 3567 3682 12760
126 4780 4890 3567 3682 12876
127 4891 5000 3567 3682 12760
128 1 3297 3567 3682 382452
129 3298 3454 3567 3682 18212
130 3455 3565 3567 3682 12876
131 3566 3675 3567 3682 12760
132 3676 3785 3567 3682 12760
133 3786 3896 3567 3682 12876
134 3897 4006 3683 5000 144980
135 4007 4117 3567 3682 12876
136 4118 4227 3683 5000 144980
137 4228 4337 3683 5000 144980
138 4338 4448 3683 5000 146298
139 4449 4558 3683 5000 144980
140 4559 4669 3683 5000 146298
141 4670 4779 3683 5000 144980
142 4780 4890 3683 5000 146298
143 4891 5000 3683 5000 144980
144 1 3297 3683 5000 4345446
145 3298 3454 3683 5000 206926
146 3455 3565 3683 5000 146298
147 3566 3675 3683 5000 144980
148 3676 3785 3683 5000 144980
149 3786 3896 3683 5000 146298
150 3897 4006 1 2062 226820
151 4007 4117 3683 5000 146298
152 4118 4227 1 2062 226820
153 4228 4337 1 2062 226820
154 4338 4448 1 2062 228882
155 4449 4558 1 2062 226820
156 4559 4669 1 2062 228882
157 4670 4779 1 2062 226820
158 4780 4890 1 2062 228882
159 4891 5000 1 2062 226820
160 1 3297 1 2062 6798414
161 3298 3454 1 2062 323734
162 3455 3565 1 2062 228882
163 3566 3675 1 2062 226820
164 3676 3785 1 2062 226820
165 3786 3896 2063 2232 18870
166 3897 4006 2063 2232 18700
167 4007 4117 2063 2232 18870
168 4118 4227 2063 2232 18700
169 4228 4337 2063 2232 18700
170 4338 4448 2063 2232 18870
171 4449 4558 2063 2232 18700
172 4559 4669 2063 2232 18870
173 4670 4779 2063 2232 18700
174 4780 4890 2063 2232 18870
175 4891 5000 2063 2232 18700
176 1 3297 2063 2232 560490
177 3298 3454 2063 2232 26690
178 3455 3565 2233 2351 13209
179 3566 3675 2063 2232 18700
180 3676 3785 2233 2351 13090
181 3786 3896 2233 2351 13209
182 3897 4006 2233 2351 13090
183 4007 4117 2233 2351 13209
184 4118 4227 2233 2351 13090
185 4228 4337 2233 2351 13090
186 4338 4448 2233 2351 13209
187 4449 4558 2233 2351 13090
188 4559 4669 2233 2351 13209
189 4670 4779 2233 2351 13090
190 4780 4890 2352 2462 12321
191 4891 5000 2233 2351 13090
192 1 3297 2352 2462 365967
193 3298 3454 2233 2351 18683
194 3455 3565 2352 2462 12321
195 3566 3675 2233 2351 13090
196 3676 3785 2352 2462 12210
197 3786 3896 2352 2462 12321
198 3897 4006 2463 2572 12100
199 4007 4117 2352 2462 12321
200 4118 4227 2463 2572 12100
201 4228 4337 2352 2462 12210
202 4338 4448 2463 2572 12210
203 4449 4558 2352 2462 12210
204 4559 4669 2463 2572 12210
205 4670 4779 2352 2462 12210
206 4780 4890 2573 2683 12321
207 4891 5000 2573 2683 12210
208 1 3297 2463 2572 362670
209 3298 3454 2573 2683 17427
210 3455 3565 2463 2572 12210
211 3566 3675 2573 2683 12210
212 3676 3785 2463 2572 12100
213 3786 3896 2684 2793 12210
214 3897 4006 2463 2572 12100
215 4007 4117 2684 2793 12210
216 4118 4227 2573 2683 12210
217 4228 4337 2684 2793 12100
218 4338 4448 2573 2683 12321
219 4449 4558 2684 2793 12100
220 4559 4669 2573 2683 12321
221 4670 4779 2794 2903 12100
222 4780 4890 2573 2683 12321
223 4891 5000 2794 2903 12100
224 1 3297 2684 2793 362670
225 3298 3454 2794 2903 17270
226 3455 3565 2684 2793 12210
227 3566 3675 2794 2903 12100
228 3676 3785 2684 2793 12100
229 3786 3896 2904 3014 12321
230 3897 4006 2684 2793 12100
231 4007 4117 2904 3014 12321
232 4118 4227 2794 2903 12100
233 4228 4337 2904 3014 12210
234 4338 4448 2794 2903 12210
235 4449 4558 2904 3014 12210
236 4559 4669 2794 2903 12210
237 4670 4779 3015 3124 12100
238 4780 4890 2794 2903 12210
239 4891 5000 3015 3124 12100
240 1 3297 2904 3014 365967
241 3298 3454 3015 3124 17270
242 3455 3565 2904 3014 12321
243 3566 3675 3015 3124 12100
244 3676 3785 2904 3014 12210
245 3786 3896 3125 3235 12321
246 3897 4006 2904 3014 12210
247 4007 4117 3125 3235 12321
248 4118 4227 3015 3124 12100
249 4228 4337 3015 3124 12100
250 4338 4448 3015 3124 12210
251 4449 4558 3015 3124 12100
252 4559 4669 3125 3235 12321
253 4670 4779 3125 3235 12210
254 4780 4890 3125 3235 12321
255 4891 5000 3125 3235 12210
Hey, I'm encountering the same issue as discussed here. I use the Archer compiler, and I resolved it in 2D using the method you described with nprocx and nprocy. (thanks for that) However, in 3D, when I attempted to use procx, nprocy, and nprocz, I encountered this error:
aborting job: Fatal error in PMPI_Dims_create: Invalid dimension argument, error stack: PMPI_Dims_create(909): MPI_Dims_create(nnodes=2560, ndims=3, dims=0x7ffe03f7157c) failed PMPI_Dims_create(897): MPIR_Dims_create(625): Cannot partition nodes as requested MPICH ERROR [Rank 1284] [job id 5727692.0] [Mon Jan 15 20:42:52 2024] [nid002874] - Abort(739332619) (rank 1284 in comm 0): Fatal error in PMPI_Dims_create: Invalid dimension argument, error stack: PMPI_Dims_create(909): MPI_Dims_create(nnodes=2560, ndims=3, dims=0x7ffe381661fc) failed PMPI_Dims_create(897): MPIR_Dims_create(625): Cannot partition nodes as requested
And when I don't defined nprocx, nprocy, nprocz I have this error:
aborting job:
Fatal error in PMPI_Recv: Message truncated, error stack:
PMPI_Recv(177)....: MPI_Recv(buf=0x3cba600, count=1, dtype=USER
Hey @taliameir,
There's something unusual going on with the Archer compiler - our default stopped working after one of their updates, so I'm aware of your second bug.
The nproc
workaround should work though. Your error message suggests nprocx * nprocy * nprocz
doesn't equal your requested core count. Can you send the input deck and submission script?
Cheers, Stuart
Hello,
I am having an issue with getting my input deck to run on archer2 on more than 1 node. I've contacted the archer2 support but wanted to see if anyone else has come across this issue. I built the latest version of the code with the following (guided by archer2 support)
mkdir epoch cd epoch git clone --recursive https://github.com/Warwick-Plasma/epoch mv epoch epoch-4.19.2 cd epoch-4.19.2 git checkout v4.19.2 Edited "./SDF/FORTRAN/Makefile" by replacing Archer/Hector with ARCHER2 cd epoch2d Edited "./Makefile" by replacing Archer/Hector with ARCHER2 and adding ** "-J../SDF/FORTRAN" to MODULEFLAG. module load cray-python export COMPILER=archer2 make
My input deck runs fine on 1 node with 128 cores, producing sdf files. When I run on more than 2 nodes+, it appears to run when I check the queue but there is absolutely no output, just an empty slurm output file.
input2dArcher.txt SubScript.txt