Diacritical insensitive search

glenl commented 8 years ago

This is moved here from issue #77 by @dominicus.

From the home page, do a keyword search for "Pathétique". I get unrelated results. Go to "Advanced Search", search for keyword="Pathétique" with no other filters. I get no results. Yet we should get at least one hit http://www.mutopiaproject.org/cgibin/piece-info.cgi?id=299

dominicus commented 8 years ago

Yes, likely related to https://github.com/MutopiaProject/MutopiaProject/issues/553 but maybe a bit worse.

If I recall correctly, the old website was able to find the piece when diacriticals were included in search term. At that time, I was updating "Für Elise", and added "Fur Elise" as keyword to "more-info" field, such that the piece could be found with either spelling. I get right result now for "fur elise", but no hit with the diacritical.

glenl commented 8 years ago

Ugh. I just edited the CGI routine to use collation code and the performance is not acceptable.

A few things:

I don't think searching with the diacritical worked before because this CGI script never decoded the CGI parameters into UTF-8 --- by default CGI parameter strings are not passed in UTF-8. This is borne out in your description (passing "Für Elise" hits neither "Für Elise" nor "Fur Elise").
I've just added that line and a search for "Für Elise" will work.
This is a keyword search and the words given are broken down into a list; the list is then iterated and both keywords must be found (logical AND) to "pass". This means you would get an identical match for "fur elise" as you would for "elise fur".

Let's move forward and create some requirements for making this happen correctly (using Für Elise as the example):

A keyword with a diacritical should match its corresponding piece ("Für Elise" would match)
A normalized keyword should match ("Fur Elise" would match the same pieces as above)
The search should be case insensitive (search input "fur elise" or "für elise" or "FUR ELIse" or "FÜR ELISE" would return the same matches)

The perl Collate routines work in my tests. They don't do it fast so now it is about performance tweaking and I have already done some basic things. I have confidence I can make it somewhat faster but not as fast as basic text matching. Here are some ideas and I'll work on the first while you the other two are considered:

Improvement with code treatment: Don't do collation matches on things that don't matter (LilyPond version, Mutopia ID, copyright, meter, style date, opus, instrument) and put more filters in front of the collation search (KW search is currently done before version, instrument and style filters.)
Improvement with foreknowledge: This search is very linear so there is a substantial speedup if you were to go to "Advanced Search", choose "Beethoven" as a composer, then search for "Für Elise". Basically, the collate::match routine would only be called on pieces by Beethoven instead of all 2000 pieces.
Improvement with UI sugar: One obvious performance workaround would be to have the keyword search (in the jumbotron) always use a case-sensitive, non-collated (exact) search (keyword "Für" would match "Für" but not "für" or "fur"). Then the advanced search panel could have a check-box for "diacritical-insensitive search" in which you might expect a longer search time.

dominicus commented 8 years ago

Would it be an option, pre-processing the search target datafile, such that it is stripped of diacriticals and shifted to lowercase beforehand? User-submited search keywords also cleared of diacriticals and shifted to lowercase, before launching the search against the target datafile, to identify matching piece-IDs?

glenl commented 8 years ago

We have a certain amount of looseness in our search cache, right? It is not the archive, it is a data set that is used to find references within our archive. If I understand you correctly, you would be making our search cache a true keyword search engine --- the cache is built so that it is free of diacriticals, then search input is stripped of diacriticals, so simple pattern matching can be done. I am guessing we will find some holes but it would not be difficult to model and test.

dominicus commented 8 years ago

Yes, that's what I meant. One hole I can think of is the cache being also leveraged as data source when reporting search results. That would require post-processing to pick up the untransformed fields.

dominicus commented 8 years ago

@glenl, if the remaining work on this issue is same as was raised in https://github.com/MutopiaProject/MutopiaProject/issues/553, are you OK closing the tracker in MutopiaProject, and keep this one open to document future progress?

MutopiaProject / MutopiaWeb

Diacritical insensitive search #81