Start-up is not parallelized

SmirnGreg commented 3 years ago

First of all, thanks A LOT for your awesome package!

UltraNest version: 3.3.0
Python version: 3.8.8
Operating System: Operating System: SUSE Linux Enterprise Server 12 SP5 CPE OS Name: cpe:/o:suse:sles:12:sp5 Kernel: Linux 4.12.14-122.80-default Architecture: x86-64

Description

I am running ReactiveNestedSampler with a likelihood function which takes minutes to compute on a single core (can't be parallelized) on a large MPI cluster with many nodes, 80 cpu each. This works fine, but the first live points are not using MPI and are just calculated sequentially. When I set minimum live points to 100, it takes 100 x few minutes to get started, reporting with [ultranest] Sampling 100 live points from prior ...

When I set it below 80, I get an error that a number of live points should be at least the number of cpus on mpi (I did not save the exact error message).

When it finishes, the rest of points are calculated in parallel as expected. But with scaling to 32 nodes x 80 cpu (2560 cpus total) the first 100 points take the same time as the rest 25k, which is not good. What should I do?

What I Did

#within sbatch script
srun python -u my_fitting_script.py

JohannesBuchner commented 3 years ago

Do you have mpi4py installed?

JohannesBuchner commented 3 years ago

The relevant code is here: https://github.com/JohannesBuchner/UltraNest/blob/master/ultranest/integrator.py#L1339

It is supposed to be MPI parallelised.

JohannesBuchner commented 3 years ago

I wonder if there could be rounding issues, if you have more nodes than requested live points.

Then num_live_points_todo is 0 for rank>0 and num_live_points_todo=num_live_points for rank=0.

JohannesBuchner commented 3 years ago

Maybe it should be

            num_live_points_todo = max(1, num_live_points_missing // self.mpi_size)
            if self.mpi_rank == 0:
                # rank 0 picks up what the others did not do
                num_live_points_todo = max(1, num_live_points_missing - num_live_points_todo * (self.mpi_size - 1))

However, this will produce more initial live points than requested. They will be reduced in the next iterations.

SmirnGreg commented 3 years ago

@JohannesBuchner, thanks for your quick reply! Would you suggest setting min live points to mpi rank then?

JohannesBuchner commented 3 years ago

No, it is a bug that should be fixed. If you have 10000 cores, it's not reasonable to have to use 10000 live points (slow progression).

Can you try out the fix?

SmirnGreg commented 3 years ago

I'll be back tomorrow! I also see that it was faster when I had 80 cpu and 100 points -- then rank 0 just did 21 calls, and now it is doing all 100.

ср, 15 сент. 2021 г., 17:58 Johannes Buchner @.***>:

No, it is a bug that should be fixed. If you have 10000 cores, it's not reasonable to have to use 10000 live points (slow progression).

Can you try out the fix?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JohannesBuchner/UltraNest/issues/42#issuecomment-920148425, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACNUFNZ6NXVRDX3GXWWTL5LUCC7CHANCNFSM5ECTVPXA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

SmirnGreg commented 3 years ago

Maybe it should be

            num_live_points_todo = max(1, num_live_points_missing // self.mpi_size)
            if self.mpi_rank == 0:
                # rank 0 picks up what the others did not do
                num_live_points_todo = max(1, num_live_points_missing - num_live_points_todo * (self.mpi_size - 1))

However, this will produce more initial live points than requested. They will be reduced in the next iterations.

Shouldn't it be something like

(num_live_points_missing + self.mpi_rank) // self.mpi_size

Test case:

from dataclasses import dataclass

@dataclass
class Runner():
    mpi_size: int
    mpi_rank: int

    def live_points_todo(self, num_live_points_missing):
        return (num_live_points_missing + self.mpi_rank) // self.mpi_size

size = 10
runners = [Runner(mpi_size=size, mpi_rank=i) for i in range(size)]

missing_points = [1, 5, 10, 11, 15, 20, 25]
for missing in missing_points:
    todo = [runner.live_points_todo(missing) for runner in runners]
    print(missing,
          todo,
          sum(todo)
          )

1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 1] 1
5 [0, 0, 0, 0, 0, 1, 1, 1, 1, 1] 5
10 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] 10
11 [1, 1, 1, 1, 1, 1, 1, 1, 1, 2] 11
15 [1, 1, 1, 1, 1, 2, 2, 2, 2, 2] 15
20 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2] 20
25 [2, 2, 2, 2, 2, 3, 3, 3, 3, 3] 25

In your solution, if mpi_rank = 20 and num_live_points_missing = 30, the 0th runner will run 11 points anyway. In my solution, runners with higher ranks will get the job first -- is this bad?

PS: which is solved by (num_live_points_missing + self.mpi_size - 1 - self.mpi_rank) // self.mpi_size

1 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0] 1
5 [1, 1, 1, 1, 1, 0, 0, 0, 0, 0] 5
10 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] 10
11 [2, 1, 1, 1, 1, 1, 1, 1, 1, 1] 11
15 [2, 2, 2, 2, 2, 1, 1, 1, 1, 1] 15
20 [2, 2, 2, 2, 2, 2, 2, 2, 2, 2] 20
25 [3, 3, 3, 3, 3, 2, 2, 2, 2, 2] 25

JohannesBuchner commented 3 years ago

Yes, this looks good. It seems like your solution has two good properties:

no extra likelihood evaluations (saving power)
the evaluations are the same across the nodes (+1 difference at most)

JohannesBuchner / UltraNest

Start-up is not parallelized #42

Description

What I Did