Closed LalehB closed 8 years ago
highlight added
Hi @LalehB, a brief looks tells me that you are not really making use of vector/array operators. ParallelAccelerator does not accelerate explicit for-loops, and I'm sorry if this was not made clear in our README.
On the other hand, k-means is a parallel program, so we should be able to work out something here. We'll let you know what we can do.
I think most of the slowdown @LalehB is seeing is ParallelAccelerator compile time. I added the line @time kmeans(Float64[0], Float64[0], 0, 0, 0, 0)
before the real call in both versions of the code and ran using a standard Julia binary, and the timing results for the two kmeans
calls look like this (on my 4-core desktop machine):
With ParallelAccelerator:
16.879114 seconds (26.23 M allocations: 1.264 GB, 1.71% gc time)
6.470805 seconds (581.29 k allocations: 32.099 MB, 0.08% gc time)
Without:
0.001021 seconds (2 allocations: 128 bytes, 99.77% gc time)
4.964461 seconds (102 allocations: 11.313 KB)
So it's really the 6.470805s vs. 4.964461s that we should be comparing. But more importantly, as @ninegua points out, the code is heavy on explicit for loops, which we can't do much with.
(If you're willing to compile your own Julia from source, you can avoid these long compile times by using ParallelAccelerator.embed
.)
The program actually has a few errors, for example, writing X[i] .- centroids[j]
when both are 2D arrays will not give what you had wanted. Instead, it is indexing both arrays using a single index as if they are 1D. So the correct way of writing it is X[i,:] - centroids[j,:]
. Similar for the j
loop in UpdateCenter!
.
So here is my revised program:
using ParallelAccelerator
@acc begin
function kmeans(X,centroids,numCenter,numPoint,dim,iterNum)
label = Array(Int64,numPoint)
distance= Array(Float64,numPoint)
fill!(distance, 10000000)
for ll = 1:iterNum
#println("Iteraton number: ", ll)
for i = 1:numPoint
for j = 1:numCenter
dist = sqrt(sum((X[i,:] .- centroids[j,:]).^2))
if dist < distance[i]
distance[i] = dist
label[i] = j
end
end
end
UpdateCenter!(X,label,centroids,numCenter,numPoint,dim)
end
return label
end
function UpdateCenter!(X,label,centroids,numCenter,numPoint,dim)
count = Array(Int64,numCenter)
for j= 1:numPoint
centroids[label[j],:] += X[j,:]
count[label[j]] +=1;
end
for f = 1:numCenter
c = count[label[f]]
centroids[f,:] /= c
end
return 1
end
end
function main()
numCenter = 5
numPoint =15
dim = 2000000
X = rand(numPoint, dim)
centroids = rand(numCenter,dim)
iterNum = 100
@time kmeans(zeros(Float64,0,0), zeros(Float64,0,0), 0, 0, 0, 0)
@time label = kmeans(X,centroids,numCenter,numPoint,dim,iterNum)
for m = 1:numPoint
println(label[m])
end
end
main()
Now, with ParallelAccelerator, it runs in 110 seconds. Without ParallelAccelerator, how should I put it, well, it is still running....
However, It appears that we are not fusing as much as we could. @DrTodd13 please have a look at this when you can.
Hi @ninegua, thanks for your response! But actually the reason that I used a very high dimension (2000000) in compare to the number of points and centers (as 15 and 5) is to make it more suitable for vector/array operation so I can get speedups with that, or am I missing something else?
Thank you @ninegua for pointing the bug out. This indexing semantic of Julia is confusing since it is different than Python.
@LalehB is right. With these parameters, most of the work is in the distance calculation which is written as data-parallel. The loops don't matter.
For the more general case of kmeans, the numPoints loop should also be written as vector-style code I think. Maybe we need some form of parallel_for?
The julia run without ParallelAccelerator result came back, it was 603 seconds vs. 110 seconds with ParallelAccelerator. If we manage to tighten up the fusion part, it will be faster still.
If the intention was really to just have low dimension such as 2 or 3 instead of 2000000, then we should think of other ways to parallelize this computation. @LalehB what do you think?
Actually in the real cases we have lots of data points instead of having a big dimension @ninegua (here I just used a high dimension to check ParallelAccelerator performance). So I think being able to parallelize the loop would be better. Also here we need to iterate many times so I agree with @ehsantn that a parallel_for would help
Mathematically, this operation is essentially calculating the dot-products of two sets of vectors but with Euclidean distance instead of a simple multiply. This is the general form of matrix multiply so we could provide an abstraction for it.
However, not everyone sees it this way and providing some form of parallel_for is definitely needed. We just need to be aware that this pattern is common to optimize for it.
I thought about this a bit more. First of all, there was another bug in the above code, namely, the variable count must be initialized as zeros(Int64,numPoint)
. Secondly, the j
loop in updateCenter!
is not immediately parallelizable due to potential conflict at incrementing the centroid at the same location. So a naive parallel_for construct is not going to help, and there has to be some notion to specify reduction variable in order to do this.
Here is an revised version of the same program:
function nearest(X, centroids, i)
numCenter = size(centroids, 1)
label = 1
dist = 10000000
for j = 1:numCenter
d = sqrt(sum((X[i,:] .- centroids[j,:]).^2))
if d < dist
dist = d
label = j
end
end
return label
end
function kmeans(X,centroids,numCenter,numPoint,dim,iterNum)
label = Int[]
for ll = 1:iterNum
#println("Iteraton number: ", ll)
label = Int[ nearest(X, centroids, i) for i = 1:numPoint ]
UpdateCenter!(X,label,centroids,numCenter)
end
return label
end
function UpdateCenter!(X,label,centroids,numCenter)
for i = 1:numCenter
m = label .== i
centroids[i,:] = sum(X[m,:], 1) ./ sum(m)
end
end
The hope is that the comprehension would parallelize, and computation for centroids[i,:] can also parallelize. The above works in Julia, but since ParallelAccelerator currently doesn't support sum
across a selected dimension, it will fail when @acc is put to use. Actually I'm not sure if we'll be able to support this kind of reduction in OpenMP. Any suggestion? @ehsantn @lkuper @DrTodd13
To further improve UpdateCenter!
, we should move the i
loop inside the sum, i.e., we should exchange the loop order, in the hope that the iteration along numCenter
will always be in cache. However, our current system (OpenMP in particular) doesn't seem to allow us write custom reduction with flexible reduction function. I believe this is also an area we can improve upon.
In an ideal situation, eventually everything should be fused into one parallel reduction loop over numPoint
. We should strive to hit this goal.
@ninegua OpenMP
doesn't have an operator which gives both minimum and it's location according to OpenMP 4.0 Spec (page 167). MPI
has it though (called MPI_MINLOC
) so DistributedIR
can handle it easily. In any case, we can generate reductions "manually" for many user-specified reduction functions in any back-end.
I think providing parallel_for
with user-specified reductions is the best solution.
This version fixed a number of bugs, and works with latest ParallelAccelerator (meaning same results with and without @acc
):
using ParallelAccelerator
@acc begin
function nearest(X, centroids, i)
numCenter = size(centroids, 1)
label = 1
dist = 10000000.0
for j = 1:numCenter
d = sqrt(sum((X[i,:] .- centroids[j,:]).^2))
if d < dist
dist = d
label = j
end
end
return label
end
function kmeans(X,centroids,numCenter,numPoint::Int,dim,iterNum::Int)
local label
for ll = 1:iterNum
label = Int[ nearest(X, centroids, i) for i in 1:numPoint ]
UpdateCenter!(X,label,centroids,numCenter)
end
return label
end
function UpdateCenter!(X,label,centroids,numCenter)
for i = 1:numCenter
m = label .== i
centroids[i,:] += sum(X[m,:], 1)
centroids[i,:] /= sum(m)
end
end
end
function main()
numCenter = 5
numPoint =1000
dim = 2
X = rand(numPoint, dim)
centroids = rand(numCenter,dim)
iterNum = 100
label = kmeans(X,centroids,numCenter,numPoint,dim,iterNum)
println(centroids)
end
srand(0)
main()
As discussed, it is not an ideal solution because there are still two loops, one for producing labels, and one for updating centroids. I'm working on explicit parfor annotation with multi-reduction variables, so hopefully we'll have a resolution to this afterwards.
Paul, what about this program? It doesn't need any explicit parfor
but indmin
needs to be supported. Also, fusion needs to optimize dist
array out.
function kmeans(X, centroids, dim, numPoint, numCenter, iterNum)
for l in 1:iterNum
dist = [ sqrt(sum((X[:,i].-centroids[:,j]).^2)) for i in 1:numPoint, j in 1:numCenter]
labels = [indmin(dist[i,:]) for i in 1:numPoint]
# centroids = [ centroids[:,i]+sum(X[:,labels.==i],2) for i in 1:numCenter]
for i in 1:numCenter
centroids[:,i] = (centroids[:,i]+sum(X[:,labels.==i],2))/sum(labels.==i)
end
end
return centroids
end
function main()
dim = 2
numPoint = 1000
numCenter = 5
iterNum = 100
X = rand(dim, numPoint)
centroids = rand(dim, numCenter)
kmeans(X, centroids, dim, numPoint, numCenter, iterNum)
println(centroids)
end
srand(0)
main()
I changed the data layout since Julia is column major and features should be in memory consecutively for better locality. The output is different since random number generation order changed. Without the layout change, the output is identical to your program so this program is equivalent.
The issue is not with indmin
, since it is an inner loop anyway. The issue is with the loop over numCenter
, which is a nested loop that should be turned inside out so that the outer loop is over numPoint
, then it would fuse with the other loop that compute labels.
With d6de2cbe4e75dec98299e8a7f90f79128b028797 and related commits, we now have @par
support. Here is a revised program to make use of it. We can see that there exists only two parallel loops: one that computes accumulated centeroids and sum, and one that computes the average. The data layout is still kind of reversed, but changing it would trigger a bug that I'm filing a new issue for.
using ParallelAccelerator
@acc begin
function nearest(X, centroids, i)
numCenter = size(centroids, 1)
label = 1
dist = 10000000.0
for j = 1:numCenter
d = sqrt(sum((X[i,:] .- centroids[j,:]).^2))
if d < dist
dist = d
label = j
end
end
return label
end
function kmeans(X,centroids,iterNum::Int)
numPoint = size(X, 1)
numCenter = size(centroids, 1)
dim = size(centroids, 2)
for ll = 1:iterNum
count :: Array{Int, 1} = zeros(numCenter)
new_centroids :: Array{Float64, 2} = zeros(numCenter, dim)
@par new_centroids(.+) count(.+) for i in 1:numPoint
j = nearest(X, centroids, i)
new_centroids[j,:] += X[j,:]
count[j] += 1
end
centroids = (centroids .+ new_centroids) ./ [ 1 + count[j] for j=1:numCenter,i=1:dim ]
end
return centroids
end
end
function randCenter(X,numCenter)
numPoint = size(X,1)
dim = size(X,2)
assert(numPoint >= numCenter)
centroids = zeros(numCenter,dim)
idx = randperm(numPoint)
for i = 1:numCenter
j = idx[i]
centroids[i,:] = X[j,:]
end
return centroids
end
function main()
numCenter = 5
numPoint =10000
dim = 2
iterNum = 100
srand(0)
X = rand(numPoint,dim)
centroids = randCenter(X,numCenter)
kmeans(X,centroids,0)
@time println(kmeans(X,centroids,iterNum))
end
main()
I wrote a simple K-means code to test ParallelAccelerator's performance but I got 5 times slow down using the ParallelAccelerator. The performance result is as follows:
As you can see it is about 5 times slower!
And here is my code: