collectiveaccess / providence

Cataloguing and data/media management application
GNU General Public License v3.0
295 stars 168 forks source link

Is SqlSearch using a locale tokenizer? #421

Closed miceno closed 4 years ago

miceno commented 4 years ago

Hi

I was wondering to what extend the tokenizer of the SqlSearch plugin takes into account locales.

I mean, if bundles are localized (for instance, in Spanish), does it take the content and tokenize it in Spanish?

I had a look at search_indexing.conf but I didn't see the configuration for it.

collectiveaccess commented 4 years ago

At the moment the only localization relates to dates: it'll set the current locale for the date/time parser allowing it to accept localized date expressions. Everything else is that same, including bits like the stemmer that can benefit from localization.

miceno commented 4 years ago

I am trying to understand what parts need to be rewritten in order to allow Spanish or Catalan indexing.

As of now, I think that a bundle with locale in Spanish, SQLSearch module will try to index, stem and tokenize it as English text, am I right?

I found the Snowball Stemmer (SnoballStemmer, it is wrongly spell, isn't it?).

It first tries to use stem PECL module, so there will be some support for other languages stemming in case the module is available.

In my case I am planning to run it on shared hosting, so installing stem module is not an option as of now.

What about trying to plug a PHP implementation of language stemmers using part of this stemmer at https://github.com/wamania/php-stemmer?

Once the stemmer is language dependent, what other components on the SQLSearch should be adapted?

collectiveaccess commented 4 years ago

That's correct. All stemming will be done using English rules.

A long time ago (like 10+ years ago) we considered fully building out the search's ability to adjust for locale. The Snoball stemmer ("Snoball" is the correct name) was one part of the equation. The fingerprint system was another. The fingerprint code works by looking for certain patterns in character usage that can reliably distinguish one language from another. With enough text it does a decent job of distinguishing German from Hungarian (for example). You can really assume that just because a user's locale is Spanish that all of their searches are going to be in Spanish. With fingerprinting, you can try to a least characterize the user's input. There must be better algorithms to detect language from a text sample these days, but it's something I've not researched in the past decade.

As you can tell from delving into SqlSearch, it's a real maze. Like many things, it started as a much simpler class and kept growing as demands for functionality came in. I don't think grafting a stemmer onto it as-is is a great idea. Reworking the code to be readable, well structured and fully localizable (it is none of that currently), even if it uses the same index table structure, would be helpful. I'm not sure if it's something you're interested in working on though. I would be interested in working on it now as a collaboration, but not as a solo project.

The table structure for the index could be improved as well, but the need for a SQL-based search engine is real. Requiring a zero-configuration option for search is one of the handful of assumptions we made when we started back in 2003 that still holds today. ElasticSearch is a great option, but so many of our users cannot practically run it or support it.

miceno commented 4 years ago

I realized the LanguageDetection code is there but it is dead code, since there is no living code referring to it. It is also the same for the StemmingFilter.

I would like to give a try to adding locale support to SqlSearch, but I am a little bit scared about the size of the code.

I will need some guidance, so I am not sure about what you mean about "collaborator". I will appreciate you guide me on how to approach it, I don't feel like rewriting pieces of code I don't quite properly understand, just by reverse engineering.

I plan to do it step by step: first add locale support minimally, then try to refactor code.

The stages I was planning are:

  1. add minimum support to indexer localization.

So, I was planning to allow support for locale on the SqlSearch, I think I found a way for it, on the getWordID function.

I am not an expert on NL processing, so take this just an intuition: we need a language-dependent stemmer to make sure values in column ca_sql_search_words.stem are properly inserted. So to get the language, I will need to use the locale of the attribute or the current global locale.

And I think we also need to store the language in ca_sql_search_words.locale_id, but not quite sure this (see next stage), because I have not yet reviewed the search code thoroughly.

  1. add localization support for searching. It is not clear to me this point yet. So I will appreciate some guidance too.

  2. refactor. I read part of the code and at least there are opportunities to refactor common and repeating code, but beyond that it is not clear to me how to give the code another structure. Anyway it requires robust testing to make sure I will not break or remove important parts. I am not sure if there are tests that cover all the cases. I will run coverage again and review the coverage on that area.

collectiveaccess commented 4 years ago

By collaboration, I mean work together to create a new search engine plugin that supports localization. I don't think it's a good idea to try to expand the existing SqlSearch code any longer. Like many things in CA, it started out simply and grew "organically" as new user requirements were heaped on it. It's a real maze.

As you know CA already supports engine plugins, so this new engine would just another plugin.The old engine can remain for as long as people want to use it.

Modifying the indexer is ok, but care will have to be taken to not break ElasticSearch indexing.

Locale detection for indexing is indeed just a matter of looking at the locale_id of the content being indexed, using the correct stemmer and stashing the locale value in the word entry. A unique word is the word text + the locale. Eg. "radio" in en_US is different from "radio" in es_ES. Nothing difficult here.

The language detection code was intended for detection of language in searches, where the user would (presumably) not want to specify the language every time they search. You could assume the language based upon the user's locale, of course, but in most multilingual systems the user's locale is not an accurate predictor of the language they're searching in.

The problem with the language fingerprinting code is that it often requires a fair amount of text to accurately determine the language being used, more than what most users would enter for a search. There are other ways to infer the language used, but they all have trade offs. I'd be happy to further discuss them.

Hopefully you now understand why we just put the this to the side and never looked back :-) Most of our users (at least the ones we speak to) are working in a single language (not necessarily English), where none of this poses a big problem. Still, I'd welcome the opportunity to implement a more maintainable search engine.

miceno commented 4 years ago

So I had a more thorough look at the code and I think there are these approaches:

  1. write a SqlSearchRevamped plugin, rewriting the code from scratch, but keeping the architecture of the SqlSearch Plugin. Is there any other source of information about current implementation so I can advance on evaluating this approach? I read that current SqlSearch is an improvement over MySQL Full Text search, but it is not clear to me the reason (probably multilingual support?), since I was planing to consider simplifying the search just delegating to MySQL the responsibility of indexing and searching.
  2. write a full new search plugin starting from scratch with new technology. I am considering a Lucene plugin using Zend Framework. I am not quite sure about why there is no such a plugin yet: elasticsearch is Lucene at the end, and a full PHP Lucene implementation is available under Zend Framework (version 1.12, but not in 2.0 nor in Laminas). My intuition is that it might not be trivial maintaining indexes and probably performance will be also an issue in this case.

I think you are suggesting the first one, don't you?

collectiveaccess commented 4 years ago

The old MySQL fulltext search plugin used type=FULLTEXT indices to do general searching, and SQL queries to do field-specific searching. The issue with it was that MySQL fulltext search was inflexible (it pretty much gave you what it wanted to give you, and there weren't many parameters to control) and performance wasn't very good. I don't know how you'd localize it. And I don't know how you'd effectively deal with all of the odd but important cases that make the current SqlSearch so much fun to work with :-/ Maybe that has changed in the past ~8-10 years, but I don't believe it has.

There is/was a PHP Lucene plugin in Zend, and that's what was used for the very first search engine for CA. The idea was that the same syntax could be used for both a zero-config option (PHP Lucene) and an "advanced" option (SOLR at the time). It didn't work out too well, as PHP Lucene provided absolutely terrible, unusable performance, and in the end we had to drop it. We decided to keep the parser though – it wasn't something we wanted to write ourselves if we could avoid it.

I'm suggesting that we keep the current indexing format (ca_sql_search_words and ca_sql_search_word_index) more or less, but do a rewrite from scratch that is better structured, testable and fully localizable.

SOLR was dropped some time later as ElasticSearch saw more uptake.

miceno commented 4 years ago

Thank you for the details!

I fully agree that keeping Lucene query syntax is a must, it is very expressive.

So I will try to write the new plugin from scratch, I will create a dev branch for it and I will ask you to advice on it.

Before starting with it, I will give a last try to adding another stemmer and storing the language on the database, I think it requires a little effort, and then I will move to your suggested approach.

collectiveaccess commented 4 years ago

If you want to mess with the existing plugin for a bit, go for it. Perhaps I'm wrong, and there's a way to do it without too much effort.

If you would prefer I start a new plugin I can work on it in the next 1-2 weeks. I know this is all a lot of old new-to-you code to digest.

miceno commented 4 years ago

Thank you!

I am working on adding locale over current code and I manage to store word_id with their locale during indexing. I am now on the search stage. I expect to have it soon. I am also trying to add some testing to make sure code works as expected on the new changes.

I will appreciate you would start the new plugin and then I can take it from there. As you said it is a lot of code and I don't understand most of it :(

I would like to ask you about the N-gram command line tool. I had a look at it and it is not using the language detector N-gram methods, LanguageDetection::createNGrams, but a much simpler helper caNgrams@utilityHelpers.php. What is the rationale to prefer one over the other?

collectiveaccess commented 4 years ago

Ngrams were meant to be used to a "did you mean" feature and fuzzy matching of words. The utilityHelpers function is just a much simpler way to generate them. None of it is really used in the currently implementation (I just stripped out the old "suggest"/"did you mean" functionality because it needs a re-think)

miceno commented 4 years ago

Locale detection for indexing is indeed just a matter of looking at the locale_id of the content being indexed, using the correct stemmer and stashing the locale value in the word entry. A unique word is the word text + the locale. Eg. "radio" in en_US is different from "radio" in es_ES. Nothing difficult here.

In order to have a unique index on ca_sql_search_words for word and locale as you suggested, we have 2 options:

  1. concatenate word + locale and store it in word. In addition, store locale on locale field.
  2. store word and locale in separate fields (as it is now), and change the unique index on word, to be a unique index on word+locale.

I will suggest going for the second. So I need to modify the schema and it is not clear to me how to do it, I mean, what files to update. I think I need to create a migration file, but not quite sure about how to do it, I think I need to:

Regards.

collectiveaccess commented 4 years ago

Separate fields for word and locale is the way to go, for sure.

Creating a migration is a three-step manual process:

  1. Determine the migration number for the new migration. This will be the current migration number defined in app/version.php + 1. Change the value in app/version.php to the new number.
  2. Modify install/inc/schema_mysql.sql, taking care to increment the migration number at the bottom of the file.
  3. Create a file in support/sql/migrations with the name .sql. Eg. 164.sql. Any SQL statements put in here will be run for the migration. For adding/modifying/deleting fields from existing tables these will be ALTER statements, of course. Make sure that the last line in the file sets the current migration number in ca_schema_updates. The easiest way to do this is to copy and modify the previous migration.
miceno commented 4 years ago

Thank you!

I am on it now, but I need also a Catalan language stemmer, so I am delayed till they accept my PR to the one I selected (wamania/php-stemmer).

In the meantime, I managed to make indexer aware of locale of the bundle.

On the search, I took a not so elegant approach: I stem the terms in every available cataloguing language and then query on the resulting stems, since language detection is not an option, since searches might be just one word.

In the meantime I also tested the LanguageDetector and it does a pretty good job, I benchmarked it against a couple of other detectors and CA won :)

I realized that LanguageDetection is better if you narrow it to a limited set of language (of course), so in case we decide activating it again, we should consider probably using only cataloguing languages.

collectiveaccess commented 4 years ago

Stemming in all locales that are available for cataloguing isn't terrible. Most systems will have only 1-3 cataloguing locales (I don't think I've ever seen one with more than 3). Are you creating a separate word entry for each stem? Or creating multiple stems for a single word entry? Using multiple word entries would make the index much bigger, and it's already too big in many cases :-/

I agree that language detection, if used, would be only for cataloguing languages. There's really no reason I can see to not limit it in that way.

miceno commented 4 years ago

New PR available for this #441

miceno commented 4 years ago

Giving up