blowhacker / snappy

Automatically exported from code.google.com/p/snappy
Other
0 stars 0 forks source link

Change to FindMatchLength to improve performance #35

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Attached find a svn diff to snappy-internal.h FindMatchLength that slightly 
improves performance, ~0-3.5%.  The usual caveats apply- the compilers and 
compiler versions used, along with the microarchitecture the code runs on, may 
or may not result in an improvement for your combination(s).  All I can say is 
that for me (Mac OS X, Core 2 T9550 @ 2.66GHz, gcc 4.2.1), it results in an 
improvement.

Basically, the change is to modify FindMatchLength so it doesn't explicitly 
keep track of the number of matched bytes, but instead increments pointers, and 
the number of characters matched is the incremented pointer - where the match 
started.  Additionally, the primary match check was modified from a 64-bit 
comparison that when it fails, it computers the XOR of the two 64-bit values.  
Instead, the modified code performs the XOR first, and then checks if the XOR'd 
result is == 0.  If the XOR'd result is != 0, there is no need to (possibly 
re-read) and XOR the values.

A non-diffed version of FindMatchLength is below:

static inline int FindMatchLength(const char* s1,
                                  const char* s2,
                                  const char* s2_limit) {
  DCHECK_GE(s2_limit, s2);
  const char *ms1 = s1, *ms2 = s2;

  // Find out how long the match is. We loop over the data 64 bits at a
  // time until we find a 64-bit block that doesn't match; then we find
  // the first non-matching bit and use that to calculate the total
  // length of the match.
  while (PREDICT_TRUE(ms2 < s2_limit - 8l)) {
    uint64 x = UNALIGNED_LOAD64(ms1) ^ UNALIGNED_LOAD64(ms2);
    if (x == 0ul) {
      ms1 += 8ul;
      ms2 += 8ul;
    } else {
      // On current (mid-2008) Opteron models there is a 3% more
      // efficient code sequence to find the first non-matching byte.
      // However, what follows is ~10% better on Intel Core 2 and newer,
      // and we expect AMD's bsf instruction to improve.
      int matching_bits = Bits::FindLSBSetNonZero64(x);
      ms1 += matching_bits >> 3;
      return ms1 - s1;
    }
  }
  while (PREDICT_TRUE(ms2 < s2_limit)) {
    if (PREDICT_TRUE(*ms1 == *ms2)) {
      ++ms1;
      ++ms2;
    } else {
      return ms1 - s1;
    }
  }
  return ms1 - s1;
}

Original issue reported on code.google.com by john.eng...@gmail.com on 22 Apr 2011 at 12:34

Attachments:

GoogleCodeExporter commented 9 years ago
Hi,

I'm unable to reproduce your results; in fact, it's ~5% slower here than what's 
currently in Subversion. Could you try with a newer version of gcc? 4.2.1 is 
pretty old these days.

FWIW, the one vs. two loads shouldn't really matter at all; the compiler should 
be able to cache that in registers where needed.

Original comment by se...@google.com on 26 Apr 2011 at 12:48

GoogleCodeExporter commented 9 years ago

Original comment by se...@google.com on 26 Apr 2011 at 12:56

GoogleCodeExporter commented 9 years ago
Given that there's no response for a week, and the benefits are unreproducible, 
I'm going to close this bug. Feel free to reopen if you can reproduce on more 
modern compilers.

Original comment by se...@google.com on 3 May 2011 at 11:28