clouway / cuse

clouWay universal search engine
1 stars 7 forks source link

Word Breakdown Comparison #27

Open GrishaAngelov opened 9 years ago

GrishaAngelov commented 9 years ago

During development of searching for client device by partial serial number we came up with TextIndex which will provide a set of indexes, the all combinations of subwords for a given set of words. This is simple comparison between the two approaches (TextIndex and IndexWriter). You can use it if you find it useful.

Test word : mississippi

public class TextIndex {
  private final Set<String> words;

  public TextIndex(String... words) {
    this.words = newLinkedHashSet(newArrayList(words));
  }

  public List<String> generate() {
    Set<String> index = newLinkedHashSet();

    for (String word : words) {
      word = word.toUpperCase();

      for (int i = 0; i < word.length(); i++) {
        for (int j = i + 1; j < word.length(); j++) {
          index.add(word.substring(i, j));
        }

        index.add(word.substring(i));
      }
    }

    return newArrayList(index);
  }
}

Generated Indexes: 53

M MI MIS MISS MISSI MISSIS MISSISS MISSISSI MISSISSIP MISSISSIPP MISSISSIPPI I IS ISS ISSI ISSIS ISSISS ISSISSI ISSISSIP ISSISSIPP ISSISSIPPI S SS SSI SSIS SSISS SSISSI SSISSIP SSISSIPP SSISSIPPI SI SIS SISS SISSI SISSIP SISSIPP SISSIPPI ISSIP ISSIPP ISSIPPI SSIP SSIPP SSIPPI SIP SIPP SIPPI IP IPP IPPI P PP PPI PI

IndexWriter

Generated Indexes: 25

ssissippi issippi sissippi mis ssissip iss sippi mississ m ippi ississipp mississipp mississippi i ppi missi mississip missis mississi ssippi sissi ississippi pi mi miss

mlesikov commented 9 years ago

thank you for your suggestion - good optimization

FeNoMeNa commented 9 years ago

Here I'm attached the Production and Test classes - https://gist.github.com/FeNoMeNa/9ee6e6a0ae41c173e4f5