Closed SmirnGreg closed 3 years ago
Do you have mpi4py installed?
The relevant code is here: https://github.com/JohannesBuchner/UltraNest/blob/master/ultranest/integrator.py#L1339
It is supposed to be MPI parallelised.
I wonder if there could be rounding issues, if you have more nodes than requested live points.
Then num_live_points_todo is 0 for rank>0 and num_live_points_todo=num_live_points for rank=0.
Maybe it should be
num_live_points_todo = max(1, num_live_points_missing // self.mpi_size)
if self.mpi_rank == 0:
# rank 0 picks up what the others did not do
num_live_points_todo = max(1, num_live_points_missing - num_live_points_todo * (self.mpi_size - 1))
However, this will produce more initial live points than requested. They will be reduced in the next iterations.
@JohannesBuchner, thanks for your quick reply! Would you suggest setting min live points to mpi rank then?
No, it is a bug that should be fixed. If you have 10000 cores, it's not reasonable to have to use 10000 live points (slow progression).
Can you try out the fix?
I'll be back tomorrow! I also see that it was faster when I had 80 cpu and 100 points -- then rank 0 just did 21 calls, and now it is doing all 100.
ср, 15 сент. 2021 г., 17:58 Johannes Buchner @.***>:
No, it is a bug that should be fixed. If you have 10000 cores, it's not reasonable to have to use 10000 live points (slow progression).
Can you try out the fix?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JohannesBuchner/UltraNest/issues/42#issuecomment-920148425, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNUFNZ6NXVRDX3GXWWTL5LUCC7CHANCNFSM5ECTVPXA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Maybe it should be
num_live_points_todo = max(1, num_live_points_missing // self.mpi_size) if self.mpi_rank == 0: # rank 0 picks up what the others did not do num_live_points_todo = max(1, num_live_points_missing - num_live_points_todo * (self.mpi_size - 1))
However, this will produce more initial live points than requested. They will be reduced in the next iterations.
Shouldn't it be something like
(num_live_points_missing + self.mpi_rank) // self.mpi_size
Test case:
from dataclasses import dataclass
@dataclass
class Runner():
mpi_size: int
mpi_rank: int
def live_points_todo(self, num_live_points_missing):
return (num_live_points_missing + self.mpi_rank) // self.mpi_size
size = 10
runners = [Runner(mpi_size=size, mpi_rank=i) for i in range(size)]
missing_points = [1, 5, 10, 11, 15, 20, 25]
for missing in missing_points:
todo = [runner.live_points_todo(missing) for runner in runners]
print(missing,
todo,
sum(todo)
)
1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 1] 1
5 [0, 0, 0, 0, 0, 1, 1, 1, 1, 1] 5
10 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] 10
11 [1, 1, 1, 1, 1, 1, 1, 1, 1, 2] 11
15 [1, 1, 1, 1, 1, 2, 2, 2, 2, 2] 15
20 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2] 20
25 [2, 2, 2, 2, 2, 3, 3, 3, 3, 3] 25
In your solution, if mpi_rank = 20
and num_live_points_missing = 30
, the 0th runner will run 11 points anyway. In my solution, runners with higher ranks will get the job first -- is this bad?
PS: which is solved by
(num_live_points_missing + self.mpi_size - 1 - self.mpi_rank) // self.mpi_size
1 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0] 1
5 [1, 1, 1, 1, 1, 0, 0, 0, 0, 0] 5
10 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] 10
11 [2, 1, 1, 1, 1, 1, 1, 1, 1, 1] 11
15 [2, 2, 2, 2, 2, 1, 1, 1, 1, 1] 15
20 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2] 20
25 [3, 3, 3, 3, 3, 2, 2, 2, 2, 2] 25
Yes, this looks good. It seems like your solution has two good properties:
First of all, thanks A LOT for your awesome package!
Description
I am running
ReactiveNestedSampler
with a likelihood function which takes minutes to compute on a single core (can't be parallelized) on a large MPI cluster with many nodes, 80 cpu each. This works fine, but the first live points are not using MPI and are just calculated sequentially. When I set minimum live points to 100, it takes 100 x few minutes to get started, reporting with[ultranest] Sampling 100 live points from prior ...
When I set it below 80, I get an error that a number of live points should be at least the number of cpus on mpi (I did not save the exact error message).
When it finishes, the rest of points are calculated in parallel as expected. But with scaling to 32 nodes x 80 cpu (2560 cpus total) the first 100 points take the same time as the rest 25k, which is not good. What should I do?
What I Did