apache / orc

Apache ORC - the smallest, fastest columnar storage for Hadoop workloads
https://orc.apache.org/
Apache License 2.0
678 stars 480 forks source link

ORC-1610: Reduce the number of hash computation in `CuckooSetBytes` #1785

Closed cxzl25 closed 7 months ago

cxzl25 commented 7 months ago

What changes were proposed in this pull request?

Add boundary conditions on "length" with the min/max length stored in the hashes.

Why are the changes needed?

https://issues.apache.org/jira/browse/HIVE-24205

This would significantly reduce the number of hash computation that needs to happen.

main insert:00:00:00.689
main lookup:00:00:01.124
PR insert:00:00:00.628
PR lookup:00:00:01.055
  @Test
  public void testLen() {
    int maxSize = 200000;
    Random gen = new Random();
    String[] strings = new String[maxSize];
    for (int i = 0; i < maxSize; i++) {
      strings[i] = RandomStringUtils.random(Math.abs(gen.nextInt(1000)));
    }
    byte[][] values = getByteArrays(strings);

    StopWatch mainSW = new StopWatch();

    // load set
    mainSW.start();
    CuckooSetBytes main = new CuckooSetBytes(strings.length);
    main.fastLookup = false;
    for (byte[] v : values) {
      main.insert(v);
    }
    mainSW.split();
    System.out.println("main insert:" + mainSW);
    // test that the values we added are there
    for (byte[] v : values) {
      assertTrue(main.lookup(v, 0, v.length));
    }
    mainSW.stop();
    System.out.println("main lookup:" + mainSW);

    StopWatch prSW = new StopWatch();
    prSW.start();
    CuckooSetBytes pr = new CuckooSetBytes(strings.length);
    pr.fastLookup = true;
    for (byte[] v : values) {
      pr.insert(v);
    }
    prSW.split();
    System.out.println("PR insert:" + prSW);
    for (byte[] v : values) {
      assertTrue(pr.lookup(v, 0, v.length));
    }
    prSW.stop();
    System.out.println("PR lookup:" + prSW);
  }

How was this patch tested?

GA

Was this patch authored or co-authored using generative AI tooling?

No

cxzl25 commented 7 months ago

BTW, there is no official Apache Hive release with this patch

release-4.0.0-alpha-1 on Mar 30, 2022

dongjoon-hyun commented 7 months ago

No, that's 'alpha`. Nobody uses it seriously, @cxzl25 .

cxzl25 commented 7 months ago

No, that's 'alpha`. Nobody uses it seriously.

Thanks for your comment, let me close this PR first.

I'm wondering how you did verify the claim's PR? Did you see the actual perf benefits in a way

I wrote a simple test that generated a random string and tested insert and lookup.

main insert:00:00:00.689
main lookup:00:00:01.124
PR insert:00:00:00.628
PR lookup:00:00:01.055
  @Test
  public void testLen() {
    int maxSize = 200000;
    Random gen = new Random();
    String[] strings = new String[maxSize];
    for (int i = 0; i < maxSize; i++) {
      strings[i] = RandomStringUtils.random(Math.abs(gen.nextInt(1000)));
    }
    byte[][] values = getByteArrays(strings);

    StopWatch mainSW = new StopWatch();

    // load set
    mainSW.start();
    CuckooSetBytes main = new CuckooSetBytes(strings.length);
    main.fastLookup = false;
    for (byte[] v : values) {
      main.insert(v);
    }
    mainSW.split();
    System.out.println("main insert:" + mainSW);
    // test that the values we added are there
    for (byte[] v : values) {
      assertTrue(main.lookup(v, 0, v.length));
    }
    mainSW.stop();
    System.out.println("main lookup:" + mainSW);

    StopWatch prSW = new StopWatch();
    prSW.start();
    CuckooSetBytes pr = new CuckooSetBytes(strings.length);
    pr.fastLookup = true;
    for (byte[] v : values) {
      pr.insert(v);
    }
    prSW.split();
    System.out.println("PR insert:" + prSW);
    for (byte[] v : values) {
      assertTrue(pr.lookup(v, 0, v.length));
    }
    prSW.stop();
    System.out.println("PR lookup:" + prSW);
  }
dongjoon-hyun commented 7 months ago

Thank you, @cxzl25 . Please add the above code and result into the PR description.

dongjoon-hyun commented 7 months ago

Merged to main/2.0 for Apache ORC 2.0.0. Thank you again!