daevaorn / djapian

High level Xapian integration for Django
Other
6 stars 3 forks source link

Highlight query terms in search result #73

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
It is a common practice to highlight words in search result which matches
terms of the search query.

What we need is the xapian.QueryParser object used by
Indexer._parse_query(), which is already a part of ResultSet, and the query
string.

Here is a generic idea following:

# Inside a model's views.py:

...

results = MyModel.indexer.search(query).spell_correction() 

# we have to call count(), get_corrected_query_string() or any other method
which calls ResultSet._get_mset()
corrected_query = results.get_corrected_query_string()

# prepare a list of normalized search terms to be used for highlighting
results._query_parser.set_stemming_strategy(xapian.QueryParser.STEM_ALL)
search_terms = list(results._query_parser.parse_query(results._query_str))

...
# highlighting query terms in result hit, for each result hit in results

value = result.instance.<field_name>

stem = xapian.Stem(settings.DJAPIAN_STEMMING_LANG)
# some wise method of splitting text into words could be used here
for word in value.split():
    # we should check if a stemmed version of the word matches any search
term (which were stemmed above)
    if stem(word.lower()) in search_terms:
        # do highlighting here: each highlighted word is going to put into
<tag></tag>
        value = value.replace(word, u'<%(tag)s>%(word)s</%(tag)s>' %
dict(tag=tag, word=word))

Original issue reported on code.google.com by esizi...@gmail.com on 4 Aug 2009 at 4:26

GoogleCodeExporter commented 9 years ago
I've missed the idea of the issue, actually ;)

I suggest that there should be a helper method in ResultSet (.highlight() ?) 
which
will make it easier to implement a highlighter for results of searches using 
Djapian,
and will hide usage of internal attributes like _query_parser and _query_str at 
the
same time.

A generic way of "highlighting" in terms of HTML/CSS should be refined either.

Original comment by esizi...@gmail.com on 4 Aug 2009 at 4:37

GoogleCodeExporter commented 9 years ago
There is already related ticket 
http://code.google.com/p/djapian/issues/detail?id=54

Original comment by daevaorn on 4 Aug 2009 at 6:34

GoogleCodeExporter commented 9 years ago
Issue 54 has been merged into this issue.

Original comment by daevaorn on 4 Aug 2009 at 6:34

GoogleCodeExporter commented 9 years ago
I don't think that they are the same. I see 2 different tasks there, even if 
they
could be used together (like highlighting in a snippet):

 1. provide a snippet from the result hit where the snippet is a part of the indexed text
 2. been able to highlight search terms in a text; there is no difference between
highlighting search terms in a snippet or in a whole text (which could be a 
single
field or the whole document)

Original comment by esizi...@gmail.com on 5 Aug 2009 at 8:03

GoogleCodeExporter commented 9 years ago
Could any body post more readable snippet how to highlight the search terms?

Original comment by and...@polyakov.name on 19 Aug 2009 at 11:17

GoogleCodeExporter commented 9 years ago

Original comment by daevaorn on 19 Sep 2009 at 9:17

GoogleCodeExporter commented 9 years ago
I don't have a more readable snippet, but the following FAQ entry in Xapian's 
wiki is 
related: http://trac.xapian.org/wiki/FAQ/Snippets

In particular, the 
http://code.google.com/p/xappy/source/browse/trunk/xappy/highlight.py code 
linked to 
from there ought to be very helpful in implementing this.

Original comment by boulton.rj@gmail.com on 23 May 2010 at 8:45

GoogleCodeExporter commented 9 years ago
I've commited a simple highlight implementation into the trunk. See 
HighlightTest test-case in the tests/search.py for usage example.

Original comment by esizi...@gmail.com on 9 Jun 2010 at 7:18

GoogleCodeExporter commented 9 years ago
To review: r361, r362 and r364 introduces initial support of search results 
highlighting.

Original comment by esizi...@gmail.com on 21 Jun 2010 at 4:09

GoogleCodeExporter commented 9 years ago
Ok. I think it is good enough for base highlighting capability. Closed!

Original comment by daevaorn on 21 Jun 2010 at 9:04

GoogleCodeExporter commented 9 years ago
i'm sorry , but the patch is obviously broken , becouse the match for applying 
the tag in the highlighting function is made against a steemed query text, 
which will be almost allways different from the word taken from the input text 
. It works in rare, simple cases.

Original comment by ortegajo...@gmail.com on 9 Nov 2010 at 2:30

GoogleCodeExporter commented 9 years ago
The patch should work just fine. I used the same approach in my project.

The idea is that in highlight(self, text, tag="strong") we are going to check 
if the __stemmed form__ of each word from the incoming "text" (HTML page or 
other source of text) matches the __stemmed form__ of any term from the search 
query (see get_parsed_query_terms(self)), then replace all occurrences of the 
__original__ word with "<tag>word</tag>" in the "text".

Original comment by esizi...@gmail.com on 9 Nov 2010 at 3:27

GoogleCodeExporter commented 9 years ago
you are right, thats the idea, but thats not what the code does

results.py, line 113
113             if stem(word.lower()) in terms:

does not produce the same results than get_parsed_query_terms(word.lower()) and 
thats where everything goes awful. 

and, something else, and ill give u an example :

sometimes the text "24)Artículo" (very wrong formated text, should have a 
space or something between the 24 and 'Artículo', but, thats not our problem) 
does match the search of the word 'articulo' at the index, but cant be 
highlited by the code in r364. A steemed form of that text would provide 2 
diferent texts, 24 and 'Artículo' that should be checked against the steemed 
query text so it can be properly highlited, in my proyect, i higlight all the 
text, like <strong>24)Artículo</strong>...  ( i know, not the best solution at 
all ).

Original comment by ortegajo...@gmail.com on 9 Nov 2010 at 5:08

GoogleCodeExporter commented 9 years ago
you are right, thats the idea, but thats not what the code does

results.py, line 113
113             if stem(word.lower()) in terms:

does not produce the same results than get_parsed_query_terms(word.lower()) and 
thats where everything goes awful. 

and, something else, and ill give u an example :

sometimes the text "24)Artículo" (very wrong formated text, should have a 
space or something between the 24 and 'Artículo', but, thats not our problem) 
does match the search of the word 'articulo' at the index, but cant be 
highlited by the code in r364. A steemed form of that text would provide 2 
diferent texts, 24 and 'Artículo' that should be checked against the steemed 
query text so it can be properly highlited, in my proyect, i higlight all the 
text, like <strong>24)Artículo</strong>...  ( i know, not the best solution at 
all ).

Original comment by ortegajo...@gmail.com on 9 Nov 2010 at 5:11

GoogleCodeExporter commented 9 years ago
Well, if the stemmer has been defined for the Indexer then stem(term) should be 
equivalent to the results of get_parsed_query_terms(...). The latter is also 
expected to drop stop-words from the search query if the stopper has been 
defined, but that's another story.

I would actually agree that this is just a very basic support which is expected 
to be extended by final users until we have found a better approach to support 
more search/highlight use-cases. Feel free to suggest ideas and contribute your 
code ;)

Original comment by esizi...@gmail.com on 9 Nov 2010 at 5:43