fergiemcdowall / search-index-adder

The indexing module for search-index
MIT License
5 stars 13 forks source link

Algorithm documentation #1

Open blahah opened 8 years ago

blahah commented 8 years ago

Apart from the code itself, is there a description of the algorithm written up anywhere?

fergiemcdowall commented 8 years ago

Its TF-IDF.

1) Use keys preceded by TF to derive sets of documents that are in your query (the TF keys, are sets, sorted by ID) 2) Using those sets, use keys preceded by RI to find documents with highest scores. (the RI keys are sorted by score)

(note: since search-index has gone through several iterations, these keys are now incorrectly named. "TF" should most probably be "DF" (document frequency), and "RI" should really be called "TF" (term frequency))

The keys are "namespaced" with the '○' char to deal with fielded search, and filters/categories/buckets: (fieldName + '○' + token + '○' + filter + '○' + filterKey).

I have written most of it, and since I was learning Node.js as I went, there is a lot of "grammar" that can be improved, and probably some conceptual improvements as well.

Thats pretty much the basics.

blahah commented 8 years ago

Thanks, and it's exact matching right? No stemming or normalisation beforehand?

fergiemcdowall commented 8 years ago

Yes- stemming and normalisation are pushed out into userland.

In the case of stemming there are basically 2 "caveman" strategies available in search-index:

1) You can index word variants in "hidden" fields (for example: turn "run" into "ran", "running", "runner", and exclude that field from results).

2) You can deal with it all on the query side by building up OR queries to search for every permutation of a given word: