flann-lib / flann

Fast Library for Approximate Nearest Neighbors
http://people.cs.ubc.ca/~mariusm/flann
Other
2.25k stars 649 forks source link

Performance bottleneck exists at irrelevant point for searching. (HierarchicalClusteringIndex) #132

Open choni opened 11 years ago

choni commented 11 years ago

Hello, Marius.

I use HierarchicalClusteringIndex::knnSearch. Performance bottleneck was found at the point that is not essential for searching.

This is the line "delete heap;" in HierarchicalClusteringIndex::findNeighborsWithRemoved.

In my case, about 85 to 90 percent time of searching was wasted at this line.

So, I commented out "delete heap;". It brings extreme fast results and huge memory leaks. :)

But, This pointed out potential for performance up of flann library, too.

Thanks.

SHIINA, Hideaki


Detailed results is below:

Environments: MacBook Pro Retina, 2.7 GHz Intel Core i7 4cores. DataSizes: Stock = 512 bits * 7961495, Query = 512bits * 500

cores = 1
OpenMP loop [outside for] = 777 ms, [inside for] = 777
time chart of HierarchicalClusteringIndex::findNeighborsWithRemoved
time[0] = 1626 us // new heap
time[1] = 88274 us // first findNN loop
time[2] = 1673 us // second findNN loop
time[3] = 675216 us // delete heap;
time[4] = 772603 us // whole

cores = 4
OpenMP loop [outside for] = 320 ms, [inside for] = 1223
time chart of HierarchicalClusteringIndex::findNeighborsWithRemoved
time[0] = 3339 us // new heap
time[1] = 111798 us // first findNN loop
time[2] = 2086 us  // second findNN loop
time[3] = 1095565 us // delete heap;
time[4] = 1220092 us // whole

cores = 1 (Memory Leak Mode)
OpenMP loop [outside for] = 95 ms, [inside for] = 94
time chart of HierarchicalClusteringIndex::findNeighborsWithRemoved
time[0] = 9646 us // new heap
time[1] = 73453 us // first findNN loop
time[2] = 1497 us // second findNN loop
time[3] = 1073 us // blank area. comment  "// delete heap;"
time[4] = 91136 us // whole

cores = 4 (Memory Leak Mode)
OpenMP loop [outside for] = 38 ms, [inside for] = 145
time chart of HierarchicalClusteringIndex::findNeighborsWithRemoved
time[0] = 13413 us // new heap
time[1] = 111663 us // first findNN loop
time[2] = 2177 us // second findNN loop
time[3] = 1463 us // blank area. comment  "// delete heap;"
time[4] = 136474 us // whole

code of measuring time is below:

    template<bool with_removed>
    void findNeighborsWithRemoved(ResultSet<DistanceType>& result, const ElementType* vec, const SearchParams& searchParams) const
    {
uint64_t startAll = getTimeMicroSeconds();
uint64_t start = getTimeMicroSeconds();
        int maxChecks = searchParams.checks;

        // Priority queue storing intermediate branches in the best-bin-first search
        Heap<BranchSt>* heap = new Heap<BranchSt>(size_);
uint64_t ellapsed = getTimeMicroSeconds() - start;
#pragma omp atomic
dbgTime[0] += ellapsed;

start = getTimeMicroSeconds();
        DynamicBitset checked(size_);
        int checks = 0;
        for (int i=0; i<trees_; ++i) {
            findNN<with_removed>(tree_roots_[i], result, vec, checks, maxChecks, heap, checked);
        }
ellapsed = getTimeMicroSeconds() - start;
#pragma omp atomic
dbgTime[1] += ellapsed;

start = getTimeMicroSeconds();
        BranchSt branch;
        while (heap->popMin(branch) && (checks<maxChecks || !result.full())) {
            NodePtr node = branch.node;
            findNN<with_removed>(node, result, vec, checks, maxChecks, heap, checked);
        }
ellapsed = getTimeMicroSeconds() - start;
#pragma omp atomic
dbgTime[2] += ellapsed;

start = getTimeMicroSeconds();
        //delete heap;
ellapsed = getTimeMicroSeconds() - start;
#pragma omp atomic
dbgTime[3] += ellapsed;

uint64_t ellapsedAll = getTimeMicroSeconds() - startAll;
#pragma omp atomic
dbgTime[4] += ellapsedAll;
    }
choni commented 11 years ago

Hi.

I dodged the bottleneck by brute changes. And, I updated pull request.

Kind regards, SHIINA, Hideaki

sromberg commented 11 years ago

Wow. I am not quite sure if I should find the static initialization of heap pointers excellent or bad. I tend to like it as it seems to speed the search up.

Btw: int myArray[10] = { 0 }; //all elements 0

choni commented 11 years ago

Oh, I didn't know a notation like

int myArray[10] = { 0 }; //all elements 0

I changed and committed it. Thanks.

choni commented 11 years ago

In case of Linux, "delete heap" instruction don't waste time so much like case of mac. Even, in that case, Heap pool makes searching fast a little.

[ new & delete heap ] cores = 2, Linux, query=512bits * 500, knnTime = 0.0728729 sec time[0] = 19006 us // new heap time[1] = 78815 us // first findNN loop time[2] = 418 us // second findNN loop time[3] = 18888 us // delete heap; time[4] = 117326 us // whole

[ using heap pool ] cores = 2, Linux, query=512bits * 500 knnTime = 0.0455761 sec time[0] = 91 us // get heap from pool time[1] = 57978 us // first findNN loop time[2] = 383 us // second findNN loop time[3] = 122 us // put back heap time[4] = 58688 us // whole

sromberg commented 11 years ago

The timings look very good. But what matters at the end is the overall time consumption. I guess you should run a at least one real application that heavily depends on approximate nearest neighbor search (e.g. approximate k-means, vector distance computations, etc) and measure the wall time to get the overall speedup.

choni commented 11 years ago

Hello, sromberg. Thanks for reply. We always tested real application (web API). It is good feeling on the changes. But certainly, time wastes at "delete heap" are large in particular on MacOS. In another environments, speed up rates is lower than in MacOS.

I want to know there is a bad effects for processing speed or not in a particular environment.

choni commented 11 years ago

Hi, Marius.

I got another big fish!! I found a time wasting line "DynamicBitset checked(size_);" in HierarchicalClusteringIndex::findNeighborsWithRemoved.

On a trial, I wrote BitsetPool it made faster than current. I'll request for pull, soon.

My results are below:

Environment: MS Azure 8cores

// only with HeapPool knnTime = 2.52116sec, cores = 1 knnTime = 1.90988sec, cores = 2 knnTime = 1.54542sec, cores = 3 knnTime = 1.3012sec, cores = 4 knnTime = 1.18767sec, cores = 5 knnTime = 1.0411sec, cores = 6 knnTime = 1.03601sec, cores = 7 knnTime = 1.05012sec, cores = 8

// with HeapPool & BitsetPool knnTime = 0.136957sec, cores = 1 knnTime = 0.0742822sec, cores = 2 knnTime = 0.0746491sec, cores = 3 knnTime = 0.0554869sec, cores = 4 knnTime = 0.0553958sec, cores = 5 knnTime = 0.0496221sec, cores = 6 knnTime = 0.0551631sec, cores = 7 knnTime = 0.04017sec, cores = 8