[PATCH] FuzzyTermEnum optimization and refactor [LUCENE-296]

asfimport commented 19 years ago

I took a look at it to see if it could be improved. I saw speed improvements of 20% - 40% by making a couple changes.

The patch is here: http://www.hagerfamily.com/patches/FuzzyTermEnumOptimizePatch.txt

The Patch is based on the HEAD of the CVS tree as of Oct 22, 2004.

What Changed?

Since the word was discarded if the edit distance for the word was above a certain threshold, I updated the distance algorithm to abort if at any time during the calculation it is determined that the best possible outcome of the edit distance algorithm is above this threshold. The source code has a great explanation.

I also reduced the amount of floating point math, reduced the amount of potential space the array takes in its first dimension, removed the potential divide by 0 error when one term is an empty string, and fixed a bug where an IllegalArgumentException was thrown if the class was somehow initialized wrong, instead of looking at the arguments.

The behavior is almost identical. The exception is that similarity is set to 0.0 when it is guaranteed to be below the minimum similarity.

Results

I saw the biggest improvement from longer words, which makes a sense. My long word was "bridgetown" and I saw a 60% improvement on this. The biggest improvement are for words that are farthest away from the median length of the words in the index. Short words (1-3 characters) saw a 30% improvement. Medium words saw a 10% improvement (5-7 characters). These improvements are with the prefix set to 0.

Migrated from LUCENE-296 by Jonathan Hager, resolved May 27 2006 Environment:

Operating System: All
Platform: All

asfimport commented 19 years ago

Daniel Naber (migrated from JIRA)

The "synchronized" for the similarity() method should no be necessary, should it?

I tried modifying the value of TYPICAL_LONGEST_WORD_IN_INDEX (e.g. to 1 or to 100) and I didn't see any real difference. Are you sure this particular optimization is useful?

As I mentioned, the speedup of all changes together is 20-40% in my short tests, so I'm sure this patch will be committed sooner or later.

asfimport commented 19 years ago

Jonathan Hager (migrated from JIRA)

The "synchronized" was added to the similarity() because it uses the d[][] across calls to it. If similarity() was called on the SAME FuzzyTermEnum object by two different threads, the results could not be guaranteed. How lucene uses FuzzyTermEnum, this will never happen. Currently, Lucene creates it and let's it be garbage collected within a single method. This is more defensive programming than anything else.

The TYPICAL_LONGEST_WORD_IN_INDEX is used in two different context. The first context is to initializing the d[][] to something smarter than [x][1]. It should save very little time as d[][] will quickly grow to the appropriate size. The second context is to create a lookup table for the max distance, so that it does not have to recalculate what the max distance is for every word. The max distance is a constant for every word of the same length. I also only saw marginal gain from this change.

I went back to estimate the acutal amount of time saved from this change in isolation. On a modern machine, it is less than 200ms for 10 million words (see test below).

public class Scratch { private static final int TYPICAL_LONGEST_WORD_IN_INDEX = 19; private static final float[] TYPICAL_WORD_LENGTH_DISTRIBUTION = { 0, 0.01f, 0.015f, 0.025f, 0.05f, 0.10f, 0.125f, 0.15f, 0.155f, 0.12f, 0.10f, 0.07f, 0.04f, 0.015f, 0.010f, 0.005f, 0.005f, 0.005f };

private final int[] maxDistances = new int[TYPICAL_LONGEST_WORD_IN_INDEX];

public Scratch() { for (int i = 0; i < maxDistances.length; i++) { maxDistances[i] = calculateMaxDistance(i); } }

private final int getMaxDistance(int m){ return (m < maxDistances.length) ? maxDistances[m] : calculateMaxDistance(m); }

private int calculateMaxDistance(int m){ return (int) ((1-0.5) * (Math.min(7, m) + 0)); }

public static void main(String[] args) {
long start = System.currentTimeMillis(); Scratch s = new Scratch(); final int totalNumberOfRuns = 10000000; for (int i = 0; i < TYPICAL_WORD_LENGTH_DISTRIBUTION.length; i++) { float v = TYPICAL_WORD_LENGTH_DISTRIBUTION[i]; final int runsForASingleLength = (int) (v * totalNumberOfRuns); //System.out.println(runsForASingleLength); for (int j=0; j < runsForASingleLength; j++) { s.getMaxDistance(i); } }

long total = System.currentTimeMillis() - start;

System.out.println("Time to lookup the records (in ms): " + total);

long start2 = System.currentTimeMillis();

for (int i = 0; i < TYPICAL_WORD_LENGTH_DISTRIBUTION.length; i++)
{
  float v = TYPICAL_WORD_LENGTH_DISTRIBUTION[i];
  final int runsForASingleLength = (int) (v \* totalNumberOfRuns);
  for (int j=0; j < runsForASingleLength; j++) {
    s.calculateMaxDistance(i);
  }
}

 long total2 = System.currentTimeMillis() - start2;

System.out.println("Time to recalculate the records (in ms): " + total2);

}

asfimport commented 19 years ago

Christoph Goller (migrated from JIRA)

The following observation should not get lost:

In his original mail Jonathan also mentioned that currently similarity may be below 0.0f. A possible fix, if this should be fixed, would be to relate the Levenstein distance to the maximum of the two length values instead of the minimum.

asfimport commented 19 years ago

Daniel Naber (migrated from JIRA)

Thanks, your patch has now been applied. I didn't do anything about that "similarity can be less than 0" issue. If anybody considers that a problem, please let us know and open a separate issue.

apache / lucene

[PATCH] FuzzyTermEnum optimization and refactor [LUCENE-296] #1374