dileepajayakody / semanticvectors

Automatically exported from code.google.com/p/semanticvectors
Other
1 stars 0 forks source link

~2x performance enhancement for BinaryVector.measureOverlap #55

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
The current measureOverlap function in BinaryVector modifies the vector and is 
slower than using xorCount. 

1. Instead of doing:
    this.bitSet.xor(binaryOther.bitSet);
    double hammingDistance = this.bitSet.cardinality();
    this.bitSet.xor(binaryOther.bitSet);

You can use the xor count function in OpenBitSet:
     double hammingDistance = OpenBitSet.xorCount(this.bitSet, binaryOther.bitSet).

What is the expected output? What do you see instead?
     This improves speed and doesn't change the bit vector making this operation thread safe.  I have a multi threaded version of the search computing n^2 similarities between pairwise vectors.  

What version of the product are you using? On what operating system?
     Using 3.2 on OSX

Please provide any additional information below.  Below is a test instance 
demonstrating the speed differences.

import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import org.apache.lucene.util.OpenBitSet;

public class XorTest {

    public static OpenBitSet generateRandomVector(Random random) {
        int dimension = 16384;
        int numEntries = 8192;
        OpenBitSet bitSet = new OpenBitSet(dimension);
        int testPlace = dimension - 1, entryCount = 0;

        // Iterate across dimension of bitSet, changing 0 to 1 if random(1) > 0.5
        // until dimension/2 1's added.
        while (entryCount < numEntries) {   
          testPlace = random.nextInt(dimension);
          if (!bitSet.fastGet(testPlace)) {
            bitSet.fastSet(testPlace);
            entryCount++;   
          }
        }
        return bitSet;
      }

    public static void main(String[] args){
        Random r = new Random();

                //generate test vectors
        List<OpenBitSet>bitarray = new ArrayList<OpenBitSet>(65536);
        for (int i = 0; i < 65536; i++){
            bitarray.add(generateRandomVector(r));
        }
        System.out.println("bit array size:" + bitarray.size());

        List<Long>distances = new ArrayList<Long>(655360);
        long totalTime = System.currentTimeMillis();
        for (int i = 0; i < 10; i++){
            OpenBitSet bitSet = bitarray.get(0);
            for (int j = 1; j < bitarray.size(); j++){
                OpenBitSet binaryOther = bitarray.get(j);
                bitSet.xor(binaryOther);
                bitSet.cardinality();
                bitSet.xor(binaryOther);
            }
        }
        totalTime = System.currentTimeMillis()-totalTime;
        System.out.println("Average time using xor:" +  totalTime/10);

        List<Long>distances2 = new ArrayList<Long>(655360);
        totalTime = System.currentTimeMillis();
        for (int i = 0; i < 10; i++){
            OpenBitSet bitSet = bitarray.get(0);
            for (int j = 1; j < bitarray.size(); j++){
                OpenBitSet binaryOther = bitarray.get(j);
                OpenBitSet.xorCount(bitSet, binaryOther);
            }
        }
        totalTime = System.currentTimeMillis()-totalTime;
        System.out.println("Average time using xorcardinality:" +  totalTime/10);
    }
}

Original issue reported on code.google.com by brant.c...@gmail.com on 1 Mar 2012 at 1:37

GoogleCodeExporter commented 9 years ago
This sounds very promising, thank you for the suggestion.

I'm cc'ing Trevor in case he knows of any reason for not incorporating this 
change immediately - I can't think of any. (I will test it for myself this 
evening.)

For now I've relabelled this as an "enhancement", it's not a bug as such, but 
we will attend to it just as dligently. 

Original comment by dwidd...@gmail.com on 1 Mar 2012 at 4:56

GoogleCodeExporter commented 9 years ago
Looks good, thanks. 

Original comment by Trever...@gmail.com on 1 Mar 2012 at 5:09

GoogleCodeExporter commented 9 years ago

Original comment by dwidd...@gmail.com on 13 Mar 2012 at 4:55