dougal / acts_as_indexed

Acts As Indexed is a plugin which provides a pain-free way to add fulltext search to your Ruby on Rails app
http://douglasfshearer.com/blog/rails-plugin-acts_as_indexed
MIT License
211 stars 49 forks source link

How is relevance determined? #30

Closed spilliton closed 12 years ago

spilliton commented 12 years ago

I have records where I only care to index their title. So I added:

acts_as_indexed :fields => [:name]

To the model (Artist) and called:

Artist.send(:build_index)

I then tried a test search:

Artist.find_with_index('Miles').each{|a| puts a.name }

Here are the first few lines output:

Miles Benjamin Anthony Robinson
John Miles
Lizzy Miles
Barry Miles
The Buddy Miles Express
Miles Away
Fuzz, Flaykes, & Shakes Vol. 1- 60 Miles High
Miles Kane
Miles
Miles From India
Miles Zuniga
Miles Davis

Seeing as I have an artist with the name "Miles" exactly, I would think that would be deemed the most relevant. Do I have something misconfigured maybe?

dougal commented 12 years ago

Hi there.

Knocked up a Rails app to test this, and I get vaguely similar results, which as you say aren't what you would expect.

I'll look at this in more detail when I have more time.

For details on the relevancy calculation, have a look at the following: https://github.com/dougal/acts_as_indexed/blob/master/lib/acts_as_indexed/search_atom.rb#L12

The linked perlmonks algorithm should help.

Thanks for reporting.

spilliton commented 12 years ago

Thanks for the speedy response!

Looking at the algorithm being used ( http://www.perlmonks.com/index.pl?node_id=27509 ), it states:

...basically what is implied by the above formula is that the weight given to term in respect to a document is higher if:

it occurs many times in that document
it doesn't appear that often in other documents in the collection

So in the output I'm seeing, the word 'miles' occurs equally in each result (1 time). So in the inverted index I think they would all have the same document score for 'miles' and thus have no defined order in relation to each other when being looked up by only the word 'miles'.

I'm thinking this issue is only noticeable when indexing on only short strings. I'd imagine it works great when indexing larger bodies of text where words are more likely to have repeats in the same record.

In my current project I do a bit of string comparing using 'amatch' and it's Levenshtein implementation which compares two strings and outputs a 0..1 value of how similar the two are. Maybe after getting the initial ordered results, we can then order results with the same score by comparing to the original search term(s).

I'll fork and experiment :)