gbhl / bhl-europe

Biodiversity Heritage Library Europe
http://www.bhl-europe.eu/
15 stars 2 forks source link

1.1.10 - Fuzzy Search (Simple Search) #42

Closed janahoffmann closed 12 years ago

janahoffmann commented 13 years ago

Search with incorrect search terms (approximate string matching).

janahoffmann commented 13 years ago

Previous comments: [akohlbecker] The portal should allow getting search result even if the search term not 100% identical to the token in the index (phonetic variant, typo, diacritics, ...) Fuzzy search can be realized by using Solr Filters, see the following links for further reference: • http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PhoneticFilterFactoryhttp://lucene.apache.org/solr/api/org/apache/solr/analysis/PhoneticFilterFactory.html

janahoffmann commented 13 years ago

Previous comments: [fwelter] I would rank this at second priority (if we work with 3). In any case such a function should only be optional, never among the default functions, which is one of the expressed results of the user survey. http://www.planetposter.de/bhl/wien2010.htm See Question 6 there, which covered this option: "Google-like search was not preferred, even less by frequent users. Google returns too many insignificant results and is not able to search for an exact sequence of letters, incorrect spellings are automatically corrected (it is not possible to search for an uncommon spelling of a name), these are shortcomings in the Google search function and annoying for professional BHL users. BHL users seem to know exactly what they are looking for."

janahoffmann commented 13 years ago

Please keep in mind that we have to focus more on the public and a fuzzy search might be more interesting for the public user. I think it might be a good way to separate default search option by user profiles, such as scientist, public user, librarian. The user survey only represents the opinion of scientists and cannot be weigthed the same. The fuzzy search is defined as a priority one feature.

fwelter commented 13 years ago

The hypothesis that fuzzy search might be more interesting for the "public user" (I understand this as the user who has practically not consulted BHL until today, and has practically not participated in the BHL User Survey) is a speculation and probably unsubstantiated. I would not rank this as a priority one feature for various reasons: (1) I see no documented evidence that the user will urgently need it. Priority one should be restricted to documented and urgent needs. (2) It took Google several years to develop modes of fuzzy search, and they had to improve the function continuously. I would expect extremely sophisticated and finely tuned programs behind it. BHL-Europe does not have Google's personal capacities. (3) Various languages would have to be considered. Search for "anales" would have to return results for "annals" in fuzzy search in English, but not necesarily in Spanish. The program would probably need to understand the language of the search query. This would be extremely difficult in the scientific context. For a Google query the program would probably assume that the user's language is that of the country associated with the IP address. This makes it easier. BHL must consider that any user in any country will is expected to speak various languages. If French words are queried by a user in Austria, the program would have to understand that French fuzzy search must be selected. (4) We should ask the programmers for an estimation of how long it would take to develop this function. It would be necessary to be able to define/develop various different degrees of fuzziness. I would anticipate that programming this feature would take much more time than the other important features, especially if no vocabularies for the scientific disciplines involved here are available. (5) Testing would also take much more time than any other features. If the testing results in too many results being returned, the degree of fuzziness would have to be reduced. This would then have to be tested again (and once again corrected etc.). And who would have to decide on that?

In summary, I would not demand programming this feature without having obtained a realistic workload estimation by the developers.

janahoffmann commented 13 years ago

@akohlbecker: Can you comment on this? estimation of effort etc. @fwelter @chris-sleep: I leave this up to the WP3 management group to decide. I understood that it does not take too much effort.

akohlbecker commented 13 years ago

First of all we should try to get a common opinion on what it actually meant by 'Fuzzy Search - Search with incorrect search terms.' is it the 'Google like search' the user survey was asking for? I suppose not since the question which has been posed "Perform simple "Google-like" keyword searches that return a long list of search results ranked by relevancy" emphasizes rather on the result list and a keyword based search. Thus I think we cannot directly use the outcome of the survey to decide on the FuzzySearch feature.

The solr framework provides many so called TokenFilters. Some of these provide a fuzzy search. These filters only have to be configured, there is no programming needed, however I am not yet sure how it can be made possible to allow the user to choose between fuzzy and non-fuzzy search. This question still is a field for investigation. The TokenFilters in question are based on the phonetic similarity (DoubleMetaphone, Metaphone, Soundex, RefinedSoundex, Caverphone) of tokens, this has the specific advantage that this approach is fairly independent from the languages being used.

fwelter commented 13 years ago

Fuzzy search had not been directly mentioned/asked in the User Surveys, and as far as I remember no User Survey participant had the idea to mention it. We realised in our discussions in the preparation sessions in Berlin (Feb 2010) that the term "Google-like" was too inaccurate and that we had to specify which components the users should consider for their rankings. We decided to include "return a long list of search results" (which is effectively exactly what will be the result if fuzzy search is implemented, and it is probably the main reason why Google returns thousands of results when you search for something), so we can say that we indirectly asked to rank fuzzy search. It seems that it was this detail ("long list of search results") that the users did not like.

Phonetic similarity is highly language dependent: words like "mair" and "mare" are phonetically similar in English, but not in German. English is a very problematic language in this sense.

akohlbecker commented 13 years ago

Oh, I suppose I was expressing my self not clearly when I wrote about the language independence of the phonetic algorithms. I meant by this that these algorithms are independent of any language specific thesaurus, the phonetic similarity is of course quite different in different languages.

A alternative of allowing a fuzzy search is to implement some fuzziness into the auto-complete / auto-suggest functionality #35 which will provide uses with a guidance in entering search terms.

chris-sleep commented 13 years ago

A very quick investigation offers that Solr supports fuzzy searching using Levenshtein distance by appending ~ to the end of a search term. This is a determination based upon string similarity only, but doesn't account for language variants/etc

It's implemented without any coding needed, but not necessarily useful. I'll dig some more to see if a similar control to adding ~ to the term is viable to activate use of TokenFilters et al

lobajuluwa commented 13 years ago

AK: long term issue with dependency on language of the content; (which stemming to use) - Moved to March 2012.