Remove _analyzer - Githubissues

rjernst commented 9 years ago

Background

The most important thing about specifying analyzers is the analyzer used at index time needs to be basically the same as the analyzer used at query time. If completely different analyzers were to be used, you would either produce terms that could never be found at query time, or query for terms that could never exist in the index. The API for specifying analyzers on fields does allow to set index and query analyzers separately. This is to allow things like synonyms where you may want to only add synonyms at index time (yields cheaper queries later), or just add them within the query (more flexible since synonyms can be changed dynamically).

`_analyzer` today

Today we have multiple ways to specify which analyzer will be used on text fields. At indexing time, the order to check is as follows:

analyzer for the field
_analyzer (proposed to remove here)
type level default analyzer (will be removed in #8874)
index level default analyzer

_analyzer is a special field in the document which specifies the name of an analyzer to use as the default for that document. This means that the same field for one document can use a completely different analyzer than another document. The typical use case for this is working with documents in many languages, where each document contains a field specifying which language its main data is in (e.g. subject and body fields). Then at query time, either a single query is used with a “magic” analyzer over that field, or a conjunction of queries that use every analyzer the data may have been indexed with.

Problems with `_analyzer`

The typical use case for _analyzer has many problems:

If using the single analyzer at query time approach, the “magic” analyzer is never good enough. It cannot possibly cover all the terms that may have been produced by each languages' analyzer, so some terms are never matchable. For example, “die” in German would be a stop word, while the same word in English is simply a regular word. Either the magic analyzer removes “die” (in which case English documents about dying can not be found) or it includes it, and German documents that contained it can not match (since the indexing process removed that term).
Scoring will be skewed. Text relevance models rely on term statistics to weigh the importance of a term for a given document versus the importance of that term for the entire index. Because some words may analyze to the same term in different languages, the frequencies of terms can be skewed, which will distort where documents matching these words appear in query results.
Mappings code is already complicated, and this feature further complicates following the logic of where Analyzers are set in code.
Having multiple ways to set the analyzer used at index or query time is also confusing on users, as they have to decide which way is “better”.
Proposal

I propose to remove _analyzer, and the associated “analyzer” setting of the match query. Removing this, along with the type level default in #8874 will simplify specifying analyzers considerably. There would be no loss of end functionality, since better results can be achieved with multiple fields.

Alternatives to `_analyzer`

One alternative to dealing with multiple languages is to use n-grams. While this can itself be tricky to deal with, it is worth mentioning.
An alternative that is more directly in line with the current uses of _analyzer is having one field per language. This requires slight modifications to client code for indexing (copy data into the field for the appropriate language, instead of specifying the language in a field) and querying (query the appropriate language field, instead of selecting an analyzer for that language. This will produce better results since documents from other languages cannot accidentally appear in the results, and relevance results should be as expected for that language.

mikemccand commented 9 years ago

+1, I think it's dangerous to allow per-document analyzers.

jpountz commented 9 years ago

+1

clintongormley commented 9 years ago

+1 - the _analyzer field is trappy and produces poor results

dakrone commented 9 years ago

+1, there are sufficient workarounds and _analyzer is not a good way to handle things

bleskes commented 9 years ago

IMHO it's tricky to just remove (as opposed to adding) a feature that enables people to do advanced things (albeit complex and trappy). I agree we should remove this long term (it is trappy and confuses people) but I think we need to do this gradually and allow people to object and educate us if they are using them.

My suggestion here would be - instead of removing the code, deprecate it and make throw an exception when it's being used and allow people to set a settings to re-enable it again. The exception should clearly communicate that we're going to remove this feature and point people to a place to comment (here?). After a period (6m?) we can evaluate what we heard and remove the code (or be surprised and do something else).

jpountz commented 9 years ago

On the contrary I think we should take advantage of this major release to remove it.

Mappings today are very complex, in particular because of corner-cases that we support (per-doc analyzers, optional document type in paths, custom index names, per-type mappings, ...). If we do not take advantage of this major release to clean things up, we will never be able to.

clintongormley commented 9 years ago

@jpountz that's not completely true. We should try to provide a deprecation period where possible, so that upgrading is not a huge obstacle. If we deprecate now, we can remove in the next major version. It's not like we'll never have the chance again.

rmuir commented 9 years ago

I dont see it: if we don't make cleanups necessary to improve things, then why should they upgrade? And in order for them to fix it, they need to rethink how they are indexing their content anyway.

So I don't think its useful to let backwards compatibility drive things for this reason. Because it prevents actual improvements from happening, and while allowing those users to upgrade to the next release without changing things, they don't get any improvements either, so there is little point.

Its just encourages stagnation, prevents change of any significance from ever happening (or delays it so long, that people give up).

rmuir commented 9 years ago

By the way, if the whole point is just to give the user advance warning, then deprecate this crap in 1.5. Problem solved.

mikemccand commented 9 years ago

+1 to deprecate in 1.5 and remove in 2.0.

uboness commented 9 years ago

+1 on deprecation in 1.5 and remove in 2.0

Uri Boness | Founder | Elasticsearch | www.elas (file:///Applications/Sparrow.app/Contents/Resources/www.elasticsearch.com)ticsearch.com (file:///Applications/Sparrow.app/Contents/Resources/www.elasticsearch.com) | +31 20 794 7300

On Friday 16 January 2015 at 18:05, Michael McCandless wrote:

+1 to deprecate in 1.5 and remove in 2.0.

— Reply to this email directly or view it on GitHub (https://github.com/elasticsearch/elasticsearch/issues/9279#issuecomment-70285721).

imotov commented 9 years ago

While I agree that the _analyzer feature in mapping is dangerous and we might be better off without it, I feel kind of iffy about totally removing an ability to specify an alternative analyzer during search time. In other words something like

{
    "match" : {
        "message" : {
            "query" : "this is a test",
            "analyzer" : "my_analyzer"
        }
    }
}

is not going to be possible. Isn't it kind of drastic?

imotov commented 9 years ago

I am worried about scenario when users want to search the same field with and without synonyms for example and then score exact match higher. Maybe we can separate this PR into _analyzer mapper part and query part since we seem to agree on the first?

dakrone commented 9 years ago

Yes, I misread this and thought this was only to remove _analyzer, I think we should split up the issues into one for removing _analyzer and the query part.

rjernst commented 9 years ago

Ok, I will remove that part from the PR.

synhershko commented 9 years ago

Huge -1, unfortunately I'm only seeing this now

Using _analyzer has made my life a lot easier when dealing with multi-lingual content. I've explained an example of possible usage and why field per language may not be sufficient in length here: https://skillsmatter.com/skillscasts/4968-approaches-to-multi-lingual-text-search-with-elasticsearch-and-lucene

To me, for full-text search, this is one key differentiator from Solr. The ability to set an analyzer per-document is so helpful deprecating this will break many existing usages I know of.

Most of the comments against _analyzer are of the "don't shoot yourself in the foot" type. Sure it can be misused to influence scoring the wrong way etc, but as long as you know what you are doing, this feature is so precious.

FWIW, ngrams are NOT an alternative for _analyzer under almost no circumstance. Please see the linked talk above for concrete examples.

I'd vote to marking this an "expert API" and discourage usage instead of complete deprecation, to allow advanced users make the most out of it.

cyclomarc commented 9 years ago

-1. We will no longer be able to process multi-lingual content and that was just the reason to use Elastic Search!

Let me give an example:

We have a multi-tenant content management system mainly used in Europe. The content entered in this system is thus always a combination of french, dutch, english, german, etc. - total of 15 languages.

When a user enters content, we determine the language of that content and when storing the content in ES, we use _analyzer pointing to a property specifying the analyzer to be used. So dutch content is indexed in dutch, french in french, etc.

At query time, we apply a similar scenario. We determine the language in which the user is performing the query and then search and specify the analyzer to be used. This works very well, even for german language. This is also the method used by Wordpress blogs.

Removing _analyzer would mean that we need to create for each content property (we have in total some 20 document types) 15 duplicates (e.g. comment_english, comment_french, comment_german, etc.). This would result in content being spread in a lot of fields and at query time we will then have to modify all our queries to search in each of these 15 fields. You might argue that you only need to search in one field, but that is not correct. If I search for a product name for example, then this name is potentially included in the body text in various language versions and we thus need to search each of these. Scoring will fail as this is no longer based on one field, etc.

This is a serious regression and I have the feeling that this decision is taken without considering typical multi-lingual scenarios that apply to Europe. I agree that for certain content it is better to store in 2 fields and not mix languages, but for content of which you do not know in advance what language it is, one field is the only answer.

Looking forward to suggestions on how to work with multi-lingual content as of version 2.0 - I thought multi-langual content was one of the strong points of Elastic search.

The use of _analyzer is described as best practice in most of the official Elastic Search books. I personally find that this is not a feature that should be removed ....

suprememoocow commented 9 years ago

-1, for all the reasons put forward by @cyclomarc and @synhershko above.

We're using a similar model to the one @cyclomarc described above, additionally supporting Asian and Middle-Eastern languages: we determine the language using cld2 at document insertion time and again at query time and apply the correct analyser. Granted, it was fairly tricky to setup, but it works very well for us.

With many languages and multiple fields for analysis, the field-per-language solution would quickly become difficult to manage and query, especially since we would need to query across all languages fields as some terms such as inline code snippets could match multiple languages.

Hope that this decision is reconsidered, and thanks for an incredible product.

rmuir commented 9 years ago

I don't think mixing analyzers in the same field is a good approach here. Instead, if you need some generic field to handle issues like product names where you are unsure about the language, consider adding a generic 'comment' field indexed in a very simple way, like with standardanalyzer.

Using this approach is easy and flexible, users can search 'comment' across all languages with a simple approach that doesn't use any stemming or other language-specific features, or they can search 'comment_de' to just search german content with german-specific analysis.

But don't mix the output of german analyzer and french analyzer in the same field and think that this is "good". This is an antipattern that will result in conflation of many unrelated terms, more than just what occurs in the language naturally, because operations like stemming will just make it much worse.

Furthermore, discarding the language by shoving it all in one field can really hurt search results. One of the worst things you can do is to return a document in a language the user does not understand. This is never a relevant search result.

I don't honestly know why people want to do this. Maybe its a fear that having a field for each language is somehow more expensive than cramming it all in one field. But for an inverted index, that is just not true.

kimchy commented 9 years ago

I agree, as the one who added this feature (it was recommended in Lucene in Action, which is where people asked for this feature in the early days of ES to begin with), we learned through the years that it leads to bad user experience, and the state of lucene has progressed tremendously as well (standard analyzer for example became significantly better). The way to solve it, as @rmuir and @rjernst suggested, is slightly more work, but results in exponentially better user experience.

synhershko commented 9 years ago

May I recommend an inner implementation that is based on @rmuir and @rjernst 's approach but is still supported by ES's API? e.g. _analyzer will be used for a rule to which field to index based on the field mapping, for instance, as opposed to which analyzer to use for one field

My main concern here is with ES's API and ease of use. The _analyzer field is a clear winner in many installations I've worked with, and I can see a way it can still be around but with different implementation under the hood.

cyclomarc commented 9 years ago

@kimchy At first site this might indeed just look like "add an extra property for each language" and when you search, simply "search in each field". But in reality if you have 20 document types with multilingual content and 15 languages, the extension of the queries is exponential. And thus also the risk on errors and maintenance. Consider also the day that you need to add language variant 16. All indexes and documents have to be rebuild. The same when you want to migrate from 1.4 to 2.0; all data to be reindexed, etc. This is not "slightly" more work ... this is a project on its own requiring a lot of analysis, testing and migration work.

kimchy commented 9 years ago

@cyclomarc here is how I think you can use it relatively simply, you can create an index template where fields matching _de for example use a specific analyzer, and _jp the same, and *_std where standard is used. Then, you only defined it in one place, you control the analysis chain for each language (as sometimes you want stemming, and sometimes you don't).

Then, when you query, you can decide how to query, you can do content_*, *_de, *_std, *_de, to search across multiple fields if you want.

A new language introduced just means updating the index template, and then starting to index the data with the new language, and so on.

@synhershko the suggestion above I think is not too complicated. Doing things internally is hard, like what qualifies as the best language analyzer chain, what happens with languages you don't know, ... . The above is a one time setup of an index template, and updating the client code to put the data in the relevant fields, instead of settings the _analyzer field.

gibrown commented 9 years ago

@kimchy just starting to wrap my head around this, but it seems like the proposed solution requires knowing every language your content is in. Feels like there are some real difficulties in how you manage indices when there are hundreds of languages and tens of analyzed fields. The search quality questions are certainly valid, but in practice we haven't seen problems with them.

We also do a lot of language detection of our content, and I am not sure how failures to correctly detect the language would interact with content being in different fields. So all our queries will probably need to query every field even if we are fairly certain the query is in one language because we may have detected the language wrong for one of the documents we are trying to search. Missing documents completely are much worse than suboptimal ranking. For really great search you need meta data beyond just the text anyways.

Are there examples of real mappings, indexing, and querying that work on tens or a hundred languages? Any performance tests of what happens when your query has to run through 400 analyzers for each separate field rather than just a single analyzer that gets applied to 4 fields?

I agree that the current API is confusing, but not sure just removing features is enough to make things less confusing when it comes to multi-lingual.

I need to go off and try building a different mapping and querying to understand the implications. (probably won't happen very soon)

rmuir commented 9 years ago

As far as tons of languages, there have been some studies on the different approaches you can take (e.g. http://ceur-ws.org/Vol-1171/CLEF2005wn-WebCLEF-MacdonaldEt2005.pdf). Note that there, the best overall approach is to basically do nothing fancy at all, like what StandardAnalyzer is doing :)

I think this is a valid approach for a lot of use cases where many languages are involved. Its one thing that motivated me to work on the icu analysis in lucene: use everything you know from unicode to do the best you can in a consistent way. And since lucene 3.1 StandardAnalyzer has a lot of the smarts that ICU has anyway[1], so I argue the "naive default" in many cases is a good consistent approach. I know for example, hathitrust.org uses this approach with over 200 languages and they have blogs about it you can find.

I am just mentioning this again, before considering super-complicated stuff to handle tons of fields * tons of languages, think about the result quality, sometimes keeping it simple is best.

[1]. Yes, icu still has more choices, like normalization and folding and word segmentation options for asian languages and lots of corner cases like greek/german casing, but in general Standard is pretty good. You can always use the icu plugin if you want those extras.

lvernaillen commented 9 years ago

@kimchy I am trying the alternative you proposed using an index template with separate fields for each language instead of the deprecated _analyzer. You mentioned index template, but do you actually mean using dynamic templates instead of index templates? Can you confirm this?

If you mean dynamic templates, then the elasticsearch guide contains an example of just that: http://www.elastic.co/guide/en/elasticsearch/guide/current/custom-dynamic-mapping.html#dynamic-templates In that example field names ending in _es will use the spanish analyzer. All other supported languages can be added in the same way. That dynamic template can be added to the _default_ mapping so you don't have to repeat it for all types.

You mentioned "A new language introduced just means updating the index template". But since an index template is only applied when an index is created, changing the template after that will not affect already created indexes. You would also need to update the already created indexes. Using dynamic mapping that is possible since they are named to allow for simple merge behavior. A new mapping, just with a new template can be "put" and that template will be added, or if it has the same name, the template will be replaced.

lvernaillen commented 9 years ago

@kimchy Some other issues come up while trying the alternative you proposed. Using wildcards in fieldnames is not always supported...

a simple match query or a more_like_this query don't seem to support wildcards for fieldnames (although it does work in multi_match), so a search for content_* (all languages, as shown below) doesn't return any results while content_en does return a result. This would mean we need to specify the content field for each language which needs updating when a language is added.

"query":{
  "more_like_this": {
    "like_text": "foo",
    "fields": [
      "content_*"
    ],
    "min_term_freq" : 1,
    "min_doc_freq": 1
  }
}

using wildcards in the fields section, specifying which fields to return, is also not posssible. So when searching in all languages you cannot simply ask to return all languages as shown below, you need to specify the content field for each language which needs updating when a language is added.

"fields":[
  "id",
  "content_*"
]

I am aware that you can prevent using content.* to search in all languages by additionally having a content field that uses a simple non-language-specific analyzer (content.std for example). However when querying that field you lose the power stemming and stop-word filters.

malaki12003 commented 8 years ago

-1. We will no longer be able to process multi-lingual content. if I try to replace with having one field per language. This requires slight modifications to client code for indexing (copy data into the field for the appropriate language, instead of specifying the language in a field) and querying (query the appropriate language field, instead of selecting an analyzer for that language. In addition, another cost is that I need to develop a custom serialize/deserialize to have proper java object in my code. Please go back and let to have this feature again.

SimonSteinberger commented 7 years ago

-1 Sorry, removal of the _analyzer option was, IMHO, a very poor decision by the Elastic team. The main reasons basically was "because people sometimes didn't use it right" or "too confusing". Well, how about removing about 90% of all other options well, because many of them can be wrongly used, which causes unexpected search results. Really, this was a severe step backwards.

When we introduce a new language to any of our system, we now need to create a new mapping and re-index the whole thing. That is bad.

jpountz commented 7 years ago

Why do you need to reindex? You could have one index or one field per language?

gibrown commented 7 years ago

@SimonSteinberger in case it helps, here are the analyzers we use for multilingual and how we configure our mappings to use them:

We have 30+ analyzers configured for different languages. All content also goes into a default analyzed field (called default in that code). So for searching in languages that don't have a custom analyzer we just search against that field. We haven't found a reason to adjust the language analyzers (or add more) in years, so just implementing all of these may make things easier for you to expand.

The main downside to look out for is that having so many fields can really explode the total number of fields. We had about 30-50k generated fields, when we added multilingual fields on top of that we started seeing cluster problems when we hit 200k fields in the index and have had to make further adjustments to our indexing (still ongoing work actually). Something to look out for.

elastic / elasticsearch

Remove _analyzer #9279

Background

`_analyzer` today

Problems with `_analyzer`

Proposal

Alternatives to `_analyzer`

elastic / elasticsearch

Remove _analyzer #9279

Background

_analyzer today

Problems with _analyzer

Proposal

Alternatives to _analyzer

`_analyzer` today

Problems with `_analyzer`

Alternatives to `_analyzer`