AntonisZks / Information-Systems-Software-Development-Project

A university project implementing Vamana-Indexing-Algorithm (VIA) for Approximate-Nearest-Neighbors (ANN) problem.
2 stars 0 forks source link

Euclidean Distance Optimizaton for n-Dimensions #65

Open StavrosJKw opened 1 week ago

StavrosJKw commented 1 week ago

During my research I came across this paper, which suggests a divide and conquer approach for the Approximate kNN Graph Construction. Specifically, it proposes a framework for computing an approximate kNN graph as follows: We divide the set of data points into subsets (possibly with overlaps), recursively compute the (approximate) kNN graphs for the subsets, then conquer the results into a final kNN graph.

I will try to test out the above concept in our code, possibly implementing and intergrading it in the second part of the project.

I will be posting regular updates of my progress via comments on this issue

AntonisZks commented 1 week ago

Great idea! I checked the paper and I came across with the following possible optimization: The EuclideanDistance between two points, each of them having 128 dimensions, keeps getting called several times across the application. Its computation possibly takes a significant amount of time and that is not good for our algorithm complexity. What I suggest is:

Whenever we compute a specific distance between two points A and B we store the distance of these points inside a hash table. By doing this, we make sure that whenever we need that distance again later in the program, we simply take it from the hash table instead of computing it again.

For the hash table data structure we can use the std::map structure from the STL library in the following way: Since all the calculation for the euclidean distance between two points is inside a specific function called EuclideanDistance(), I suggest we put some extra functionality before the main computation code that checks whether the requested distance has already been computed and is stored inside the hash table and if not, then compute it and store it. The code should look something like the following:

double euclideanDistance(const DataVector<float>& a, const DataVector<float>& b){
    if (a.getDimension() != b.getDimension()) {
        throw std::invalid_argument("Vectors must have the same dimension");
    }

    // Code to check if the given two points exists inside the hash table as a key, and if that is the case
    // return the value of this key pair, which is their distance.

    double sum = 0.0;
    unsigned int dimension = a.getDimension();
    for (unsigned int i = 0; i < dimension; ++i) {
        double diff = a.getDataAtIndex(i) - b.getDataAtIndex(i);
        sum += diff * diff;
    }

   // Code that will add the two given points as a key pair to the hash table, 
   // and the new computed distance as a value to this key pair

    return sqrt(sum);
}

@StavrosJKw take a look at my suggestion and tell me what you think?