ChenhanYu / hmlp

High-Performance Machine Learning Primitives
Other
10 stars 8 forks source link

Nearest neighbor scaling #41

Open stharakan opened 5 years ago

stharakan commented 5 years ago

We have some questions about running nearest neighbors — specifically the distributed_nearest_neighbors.cpp example. We are focusing on the DistKernelMatrix run (second half of the script), with the following parameters

n = 5000 d = 3 k = 64

As we increase the number of mpi tasks, we notice strange behavior. Most importantly, the code slows down significantly when more tasks are introduced. By the eye test, this is clear, but the reported flops and mops after each neighbor iteration tell the same story

nprocs = 1 [ RT] 31 [normal] 0 [listen] 0 [nested] 3.000E+04 flops 3.000E+04 mops [ RT] 16 [normal] 0 [listen] 0 [nested] 6.250E+06 flops 1.250E+07 mops

nprocs = 2 [ RT] 32 [normal] 0 [listen] 0 [nested] 9.000E+04 flops 9.000E+04 mops [ RT] 16 [normal] 0 [listen] 0 [nested] 6.250E+06 flops 1.250E+07 mops

nprocs = 4 [ RT] 36 [normal] 0 [listen] 0 [nested] 2.100E+05 flops 2.100E+05 mops [ RT] 16 [normal] 0 [listen] 0 [nested] 6.250E+06 flops 1.250E+07 mops

nprocs = 8 [ RT] 48 [normal] 0 [listen] 0 [nested] 2.400E+05 flops 2.400E+05 mops [ RT] 16 [normal] 0 [listen] 0 [nested] 6.250E+06 flops 1.250E+07 mops

This leads to question 1: Should the number of flops grow for the same problem size when increasing tasks, or is there a problem here?

Our hypothesis is that it has something to do with the splitter warning

[WARNING] increase the middle gap to 10 percent!

which are displayed more and more frequently as the number of processors increases. We outline that below, showing gap warnings/nn iteration.

nprocs : 1 warnings: 0 nprocs : 2 warnings: 2 nprocs : 4 warnings: 8 nprocs : 8 warnings: 24

Unless we are misunderstanding the algorithm, the warning is displayed when the there are multiple points that project to the median under that split. Our question is twofold: Why would this increase with the number of tasks? And is it fixable?

For the record, this happens even if care is taken so each processes’ data is randomized with a unique seed so there are no duplicate points. Thanks for the help!

ChenhanYu commented 5 years ago

Thank you for the question.

** [WARNING] increase the middle gap to 10 percent! Indeed it means that the many indices are very close to the median (does not have to be the same). However, there are two situations where you may see more such warnings.

** FLOPS and MOPS I need to say sorry that the FLOPS and MOPS reported by the runtime may not be accurate. Some of our tasks currently do not report the current flops and mops (tracked by #42).

** Number tasks The number of tasks increases mainly due to the complexity of the distributed tree partitioning. So you are right about more tasks with more processes.

** Slow down in performance N=5000 in 3-dimension is a very small problem size. In this case, strong scaling will not hold (i.e. doubling the MPI processes or nodes does not necessary reduce the execution time) due to the overhead.

stharakan commented 5 years ago

Thanks for the response! I didn't mention earlier, but the problem still exists on larger data and with weak scaling. I got to run some scaling experiments today and here is what I'm seeing. As a note all experiments were run on the rebel nodes (options requested -N 2) at ICES and for 10 iterations.

Strong scaling

d N tasks Time (s)
3 1e6 1 80
3 1e6 2 354

Weak scaling

d N tasks Time (s)
3 1e6 1 80
3 2e6 2 784

I don't think this should be expected behavior, but I could be wrong. I ran these tests the same way (using the second half of distributed_nearest_neighbor.cpp).

Re the gap warnings, I thought that perhaps they should increase linearly with the number of processes. Is it expected that the increase exceeds that?

ChenhanYu commented 5 years ago

Interesting. I will take a look at this case over the weekend. I am not able to use ICES's machine, but I will try to do this on Stampede2.

Regarding the warning, I will double check if the behavior is correct.

stharakan commented 5 years ago

Thanks for the response! I was able to take at the timings on Stampede2 -- I don't see a scaling issue there at all. The code runs really well. So there must be something else holding back our mpi performance on ICES machines. I recreated the same runs as above and found the following:

Strong scaling

d N tasks Time(s)
3 1e6 1 60.6
3 1e6 2 34.5

Weak scaling

d N tasks Time(s)
3 1e6 1 60.6
3 2e6 2 68.5

So it seems no need to worry about the scaling, but thanks again for taking the time to respond and check it out. The only remaining confusing thing I see is the warnings -- they only appear when I use multiple mpi tasks. In case that helps track down the behavior.

ChenhanYu commented 5 years ago

That's great. There is indeed a bug in distributed tree partition. I have fixed it and it will not emit "WARNING" unless there are quite some data points with the same coordinates (or the same distances). The fix will later be propagated to the master branch.