IntelLabs / ParallelAccelerator.jl

The ParallelAccelerator package, part of the High Performance Scripting project at Intel Labs
BSD 2-Clause "Simplified" License
294 stars 32 forks source link

K-means slow down using ParallelAccelerator #19

Closed LalehB closed 8 years ago

LalehB commented 8 years ago

I wrote a simple K-means code to test ParallelAccelerator's performance but I got 5 times slow down using the ParallelAccelerator. The performance result is as follows:

With ParallelAccelerator:
 25.990243 seconds (27.45 M allocations: 1.301 GB, 1.29% gc time)
No  ParallelAccelerator:
 4.968666 seconds (106 allocations: 76.305 MB, 0.54% gc time)

As you can see it is about 5 times slower!

And here is my code:

using ParallelAccelerator

@acc function kmeans(X,centroids,numCenter,numPoint,dim,iterNum)

    label = Array(Int64,numPoint)
    distance= Array(Float64,numPoint)
    centroids = rand(numCenter,dim)
    for j = 1:numPoint
        #distance[j] = typemax(Float64)
        distance[j] = 10000000
    end
    for ll = 1:iterNum
        #println("Iteraton number: ", ll)
        for i = 1:numPoint  
            for j = 1:numCenter 
                dist = sqrt(sum((X[i] .- centroids[j]).^2))
                if dist < distance[i]
                    distance[i] = dist
                    label[i] = j
                end
            end
        end
        UpdateCenter!(X,label,centroids,numCenter,numPoint,dim)
     end

    return label
end

function UpdateCenter!(X,label,centroids,numCenter,numPoint,dim)
    count = Array(Int64,numCenter)
    for j= 1:numPoint
        centroids[label[j]] = centroids[label[j]] .+ X[j]
        count[label[j]] +=1;
    end
    for f = 1:numCenter
        for d = 1:dim
            centroids[f,d] /= count[label[f]]
        end
    end
    return 1
end

function main()
    numCenter = 5
    numPoint =15
    dim = 2000000
    X = rand(numPoint, dim)
    centroids = rand(numCenter,dim)
    iterNum = 100
    @time label = kmeans(X,centroids,numCenter,numPoint,dim,iterNum)

    for m = 1:numPoint
        println(label[m])
    end

end

main()
LalehB commented 8 years ago

highlight added

ninegua commented 8 years ago

Hi @LalehB, a brief looks tells me that you are not really making use of vector/array operators. ParallelAccelerator does not accelerate explicit for-loops, and I'm sorry if this was not made clear in our README.

On the other hand, k-means is a parallel program, so we should be able to work out something here. We'll let you know what we can do.

lkuper commented 8 years ago

I think most of the slowdown @LalehB is seeing is ParallelAccelerator compile time. I added the line @time kmeans(Float64[0], Float64[0], 0, 0, 0, 0) before the real call in both versions of the code and ran using a standard Julia binary, and the timing results for the two kmeans calls look like this (on my 4-core desktop machine):

With ParallelAccelerator:

 16.879114 seconds (26.23 M allocations: 1.264 GB, 1.71% gc time)
  6.470805 seconds (581.29 k allocations: 32.099 MB, 0.08% gc time)

Without:

  0.001021 seconds (2 allocations: 128 bytes, 99.77% gc time)
  4.964461 seconds (102 allocations: 11.313 KB)

So it's really the 6.470805s vs. 4.964461s that we should be comparing. But more importantly, as @ninegua points out, the code is heavy on explicit for loops, which we can't do much with.

(If you're willing to compile your own Julia from source, you can avoid these long compile times by using ParallelAccelerator.embed.)

ninegua commented 8 years ago

The program actually has a few errors, for example, writing X[i] .- centroids[j] when both are 2D arrays will not give what you had wanted. Instead, it is indexing both arrays using a single index as if they are 1D. So the correct way of writing it is X[i,:] - centroids[j,:]. Similar for the j loop in UpdateCenter!.

So here is my revised program:

using ParallelAccelerator

@acc begin
function kmeans(X,centroids,numCenter,numPoint,dim,iterNum)

    label = Array(Int64,numPoint)
    distance= Array(Float64,numPoint)
    fill!(distance, 10000000)
    for ll = 1:iterNum
        #println("Iteraton number: ", ll)
        for i = 1:numPoint
            for j = 1:numCenter
                dist = sqrt(sum((X[i,:] .- centroids[j,:]).^2))
                if dist < distance[i]
                    distance[i] = dist
                    label[i] = j
                end
            end
        end
        UpdateCenter!(X,label,centroids,numCenter,numPoint,dim)
     end

    return label
end

function UpdateCenter!(X,label,centroids,numCenter,numPoint,dim)
    count = Array(Int64,numCenter)
    for j= 1:numPoint
        centroids[label[j],:] += X[j,:]
        count[label[j]] +=1;
    end
    for f = 1:numCenter
        c = count[label[f]]
        centroids[f,:] /= c
    end
    return 1
end
end

function main()
    numCenter = 5
    numPoint =15
    dim = 2000000
    X = rand(numPoint, dim)
    centroids = rand(numCenter,dim)
    iterNum = 100
    @time kmeans(zeros(Float64,0,0), zeros(Float64,0,0), 0, 0, 0, 0)
    @time label = kmeans(X,centroids,numCenter,numPoint,dim,iterNum)

    for m = 1:numPoint
        println(label[m])
    end

end

main()

Now, with ParallelAccelerator, it runs in 110 seconds. Without ParallelAccelerator, how should I put it, well, it is still running....

However, It appears that we are not fusing as much as we could. @DrTodd13 please have a look at this when you can.

LalehB commented 8 years ago

Hi @ninegua, thanks for your response! But actually the reason that I used a very high dimension (2000000) in compare to the number of points and centers (as 15 and 5) is to make it more suitable for vector/array operation so I can get speedups with that, or am I missing something else?

ehsantn commented 8 years ago

Thank you @ninegua for pointing the bug out. This indexing semantic of Julia is confusing since it is different than Python.

ehsantn commented 8 years ago

@LalehB is right. With these parameters, most of the work is in the distance calculation which is written as data-parallel. The loops don't matter.

For the more general case of kmeans, the numPoints loop should also be written as vector-style code I think. Maybe we need some form of parallel_for?

ninegua commented 8 years ago

The julia run without ParallelAccelerator result came back, it was 603 seconds vs. 110 seconds with ParallelAccelerator. If we manage to tighten up the fusion part, it will be faster still.

If the intention was really to just have low dimension such as 2 or 3 instead of 2000000, then we should think of other ways to parallelize this computation. @LalehB what do you think?

LalehB commented 8 years ago

Actually in the real cases we have lots of data points instead of having a big dimension @ninegua (here I just used a high dimension to check ParallelAccelerator performance). So I think being able to parallelize the loop would be better. Also here we need to iterate many times so I agree with @ehsantn that a parallel_for would help

ehsantn commented 8 years ago

Mathematically, this operation is essentially calculating the dot-products of two sets of vectors but with Euclidean distance instead of a simple multiply. This is the general form of matrix multiply so we could provide an abstraction for it.

However, not everyone sees it this way and providing some form of parallel_for is definitely needed. We just need to be aware that this pattern is common to optimize for it.

ninegua commented 8 years ago

I thought about this a bit more. First of all, there was another bug in the above code, namely, the variable count must be initialized as zeros(Int64,numPoint). Secondly, the j loop in updateCenter! is not immediately parallelizable due to potential conflict at incrementing the centroid at the same location. So a naive parallel_for construct is not going to help, and there has to be some notion to specify reduction variable in order to do this.

Here is an revised version of the same program:

function nearest(X, centroids, i)
    numCenter = size(centroids, 1)
    label = 1
    dist = 10000000
    for j = 1:numCenter
        d = sqrt(sum((X[i,:] .- centroids[j,:]).^2))
        if d < dist
            dist = d
            label = j
        end
    end
    return label
end

function kmeans(X,centroids,numCenter,numPoint,dim,iterNum)
    label = Int[]
    for ll = 1:iterNum
        #println("Iteraton number: ", ll)
        label = Int[ nearest(X, centroids, i) for i = 1:numPoint ]
        UpdateCenter!(X,label,centroids,numCenter)
     end
    return label
end

function UpdateCenter!(X,label,centroids,numCenter)
    for i = 1:numCenter
        m = label .== i
        centroids[i,:] = sum(X[m,:], 1) ./ sum(m)
    end
end

The hope is that the comprehension would parallelize, and computation for centroids[i,:] can also parallelize. The above works in Julia, but since ParallelAccelerator currently doesn't support sum across a selected dimension, it will fail when @acc is put to use. Actually I'm not sure if we'll be able to support this kind of reduction in OpenMP. Any suggestion? @ehsantn @lkuper @DrTodd13

To further improve UpdateCenter!, we should move the i loop inside the sum, i.e., we should exchange the loop order, in the hope that the iteration along numCenter will always be in cache. However, our current system (OpenMP in particular) doesn't seem to allow us write custom reduction with flexible reduction function. I believe this is also an area we can improve upon.

ninegua commented 8 years ago

In an ideal situation, eventually everything should be fused into one parallel reduction loop over numPoint. We should strive to hit this goal.

ehsantn commented 8 years ago

@ninegua OpenMP doesn't have an operator which gives both minimum and it's location according to OpenMP 4.0 Spec (page 167). MPI has it though (called MPI_MINLOC) so DistributedIR can handle it easily. In any case, we can generate reductions "manually" for many user-specified reduction functions in any back-end.

I think providing parallel_for with user-specified reductions is the best solution.

ninegua commented 8 years ago

This version fixed a number of bugs, and works with latest ParallelAccelerator (meaning same results with and without @acc):

using ParallelAccelerator

@acc begin

function nearest(X, centroids, i)
    numCenter = size(centroids, 1)
    label = 1
    dist = 10000000.0
    for j = 1:numCenter
        d = sqrt(sum((X[i,:] .- centroids[j,:]).^2))
        if d < dist
            dist = d
            label = j
        end
    end
    return label
end

function kmeans(X,centroids,numCenter,numPoint::Int,dim,iterNum::Int)
    local label
    for ll = 1:iterNum
        label = Int[ nearest(X, centroids, i) for i in 1:numPoint ]
        UpdateCenter!(X,label,centroids,numCenter)
     end
    return label
end

function UpdateCenter!(X,label,centroids,numCenter)
    for i = 1:numCenter
        m = label .== i
        centroids[i,:] += sum(X[m,:], 1) 
        centroids[i,:] /= sum(m)
    end
end

end

function main()
    numCenter = 5
    numPoint =1000
    dim = 2
    X = rand(numPoint, dim)
    centroids = rand(numCenter,dim)
    iterNum = 100
    label = kmeans(X,centroids,numCenter,numPoint,dim,iterNum)
    println(centroids)
end

srand(0)
main()

As discussed, it is not an ideal solution because there are still two loops, one for producing labels, and one for updating centroids. I'm working on explicit parfor annotation with multi-reduction variables, so hopefully we'll have a resolution to this afterwards.

ehsantn commented 8 years ago

Paul, what about this program? It doesn't need any explicit parfor but indmin needs to be supported. Also, fusion needs to optimize dist array out.

function kmeans(X, centroids, dim, numPoint, numCenter, iterNum)
    for l in 1:iterNum
        dist = [ sqrt(sum((X[:,i].-centroids[:,j]).^2)) for i in 1:numPoint, j in 1:numCenter]
        labels = [indmin(dist[i,:]) for i in 1:numPoint]
        # centroids =  [ centroids[:,i]+sum(X[:,labels.==i],2) for i in 1:numCenter]
        for i in 1:numCenter
            centroids[:,i] = (centroids[:,i]+sum(X[:,labels.==i],2))/sum(labels.==i)
        end 
    end 
    return centroids
end

function main()
    dim = 2 
    numPoint = 1000
    numCenter = 5 
    iterNum = 100 

    X = rand(dim, numPoint)
    centroids = rand(dim, numCenter)
    kmeans(X, centroids, dim, numPoint, numCenter, iterNum)
    println(centroids)
end

srand(0)
main()

I changed the data layout since Julia is column major and features should be in memory consecutively for better locality. The output is different since random number generation order changed. Without the layout change, the output is identical to your program so this program is equivalent.

ninegua commented 8 years ago

The issue is not with indmin, since it is an inner loop anyway. The issue is with the loop over numCenter, which is a nested loop that should be turned inside out so that the outer loop is over numPoint, then it would fuse with the other loop that compute labels.

ninegua commented 8 years ago

With d6de2cbe4e75dec98299e8a7f90f79128b028797 and related commits, we now have @par support. Here is a revised program to make use of it. We can see that there exists only two parallel loops: one that computes accumulated centeroids and sum, and one that computes the average. The data layout is still kind of reversed, but changing it would trigger a bug that I'm filing a new issue for.

using ParallelAccelerator

@acc begin

function nearest(X, centroids, i)
    numCenter = size(centroids, 1)
    label = 1
    dist = 10000000.0
    for j = 1:numCenter
        d = sqrt(sum((X[i,:] .- centroids[j,:]).^2))
        if d < dist
            dist = d
            label = j
        end
    end
    return label
end

function kmeans(X,centroids,iterNum::Int)
    numPoint = size(X, 1)
    numCenter = size(centroids, 1)
    dim = size(centroids, 2)
    for ll = 1:iterNum
        count :: Array{Int, 1} = zeros(numCenter)
        new_centroids :: Array{Float64, 2} = zeros(numCenter, dim)
        @par new_centroids(.+) count(.+) for i in 1:numPoint
            j = nearest(X, centroids, i) 
            new_centroids[j,:] += X[j,:]
            count[j] += 1
        end
        centroids = (centroids .+ new_centroids) ./ [ 1 + count[j] for j=1:numCenter,i=1:dim ]
    end
    return centroids
end

end

function randCenter(X,numCenter)
    numPoint = size(X,1)
    dim = size(X,2)
    assert(numPoint >= numCenter)
    centroids = zeros(numCenter,dim)
    idx = randperm(numPoint)
    for i = 1:numCenter
        j = idx[i]
        centroids[i,:] = X[j,:]
    end
    return centroids
end

function main()
    numCenter = 5
    numPoint =10000
    dim = 2
    iterNum = 100
    srand(0)
    X = rand(numPoint,dim)
    centroids = randCenter(X,numCenter)
    kmeans(X,centroids,0)
    @time println(kmeans(X,centroids,iterNum))
end

main()