elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.2k stars 24.84k forks source link

Support DelimitedTermFrequencyTokenFilter #27552

Closed gkinsman closed 6 years ago

gkinsman commented 6 years ago

Feature Request

I'd like to use the DelimitedTermFrequencyTokenFilter that was added to Lucene 7, but it doesn't appear to be available via Elastic's API yet.

AFAIK it just needs to be added to CommonAnalysisPlugin.java, as a factory and token filter already exist within Lucene. I'd like to not have to use a custom plugin for this.

Thanks!

jimczi commented 6 years ago

We discussed this internally during FixIt Friday. This is an expert token filter that requires to be used under certain conditions (a tokenizer that does not split on punctuations and a similarity that takes the frequency into account, ...). For this reason we'd like to understand the use cases that this new filter could help with to see if we can have a better integration than just a direct mapping. Can you share the details of the usages that you plan to make of this new filter ?

rpedela commented 6 years ago

I have a use case. Given a document, I want to detect keywords and keyphrases using ML and then use them to improve relevance. I could store them as an array in _source, but then I lose term frequency information. I could index them normally, but that isn't space efficient. I believe, though could be wrong, that DelimitedTermFrequencyTokenFilter would give the best of both worlds.

gkinsman commented 6 years ago

The use case I'm attempting is a slightly unorthodox use of elastic, but I believe it's a valid one.

I have tags on a document that are tokenized and indexed, but also have vote counts. I'm providing a search facility for these tags that utilise the vote count number to retrieve documents that have higher vote counts than other documents tagged with the same tag.

I achieved this using Lucene locally, by writing my own duplicating token filter and feeding each term through n number of times, where n is the vote count. I'm able to do a similar thing with elastic where I'd pre-filter the token stream by duplicating the tokens in the request I send to elastic. This obviously won't work well for high values of n, so it's a temporary solution.

In fact, I just used the whitespace tokenizer with Lucene which I fed into the aforementioned token filter - there's no real text analysis required for my usage as I'm storing and indexing the tags verbatim, I just want the tf-idf ranking.

Thanks for listening!

nadre commented 6 years ago

Another usecase would be to add related terms to a document if the document is short, also known as document expansion.

I created a plugin for now: https://github.com/nadre/elasticsearch-delimited-tf-token-filter

romseygeek commented 6 years ago

cc @elastic/es-search-aggs

softwaredoug commented 6 years ago

One problem this solves is allowing users to store distributed representations of text in a field, and use it in ranking. By overwriting term freq, you can create and use arbitrary vector representations in ranking and have them used efficiently.

For example, if you have a 5 dimensional embedding out of word2vec/doc2vec, you might store that in a field:

{
   "text_doc2vec": "0|75 1|1 2|14 3|52 4|12"
}

Here the vector output of doc2vec has been scaled to 0-100.

Now the similarity between this doc and query terms passed into the same word2vec model becomes rather efficient, as it's using the search engine's built in TF*IDF scoring.

I think Lucene & ES need to do a better job supporting arbitrary vector math like this. I'm finding in my work, clients regularly need this kind of functionality.

Granted, I don't know if this is the absolute best way to implement such functionality. Perhaps a new field type would be better, that could store multidimensional data using syntactic sugar around this approach. But this is one way I see this used.

gkinsman commented 6 years ago

@softwaredoug that's a great use case, and another example of where this feature could be used to reduce the work done at index/query time while increasing the flexibility of the API.

Imagine if you could write code in your own app to index a field to produce a term frequency map, and then pass that to ES. Would be a game changer!

I'm currently still hacking around the fact that I can't customise term freq by doing term stuffing.

peterdm commented 6 years ago

I have @gkinsman ‘s use case as well (vote counts or impressions).

jpountz commented 6 years ago

For the record, we already merged #30618 which is one way to leverage custom frequencies and is abstracted in its own field type.

@gkinsman What kind of field do you perform term stuffing on? Something like tags?

gkinsman commented 6 years ago

Ah interesting @jpountz, I haven't seen those before - might have to take a look.

I have two fields per doc for tags - one is an original, stored text field which is the tag and count, e.g: "tag1|3 tag2|4 tag3|1". I then have a stuffed field which is a non-stored text field with IndexOptions.Freqs enabled, and would look like "tag1 tag1 tag1 tag2 tag2 tag2 tag2 tag3". When indexed, the stats generated represent the term score.

I should also mention that both fields use just a whitespace analyzer, as the terms are verbatim.

jpountz commented 6 years ago

Out of curiosity, what is the number of unique tags that you have in your index. Is it low-cardinality and conceptually closer to a field (eg. rating: 5 or views: 12345) or high-cardinality and closer to a value (eg. tags: { politics: 12, economics: 20, germany: 50, renewable_energies: 35 })?

jpountz commented 6 years ago

Sorry, one more question: do you disable norms on this term-stuffing field or not?

gkinsman commented 6 years ago

The latter - I have high cardinality tags :).

I don't disable norms on the field currently, but it looks like I should, which would save disk.

softwaredoug commented 6 years ago

@jpountz thanks, that looks promising! I'm wondering though why I couldn't take the value of what I index directly into scoring? It seems I have to take saturation, log, or sigmoid. I don't think I could implement a dot product using that feature very easily.

To do a dot product, I would need to take the term freq directly, multiply it at query time by a query time boost.

jpountz commented 6 years ago

It was designed this way because such functionality is easier to evolve over time when it has few constrained parameters and its use-case is well understood. I do agree that it doesn't address all use-cases. @gkinsman needs a way to handle a feature vector with arbitrary (and potentially many) dimensions (the tags) and I have heard about similar use-cases in the past so I think we should do something as well. For instance @jimczi mentioned to me that in order to boost transactions in e-commerce, it is possible to attach search keywords that lead to a transaction to documents and store the likeliness of transaction in a custom term frequency. Then searching this field in addition to the usual designation/description fields is expected to help relevance and conversion rate. I am hopeful that adding something like the new feature field, but that would accept a hash, eg. features: { dim0: 20, dim1: 50, dim2: 12 } rather than a single value would address both use-cases.

Once you would have your dot product, what would you do with it? Just sum it up with the bm25 score?

softwaredoug commented 6 years ago

@jpountz

Some of this is a tangent, probably good for another issue related to working with embeddings or other vector representations in the search engine:

On what I'd do with it, certainly a feature for learning to rank. Embeddings are often fed as input to models. I'm sort of inspired by what Vespa does by giving a very math-looking interface to take arbitrary tensors or other values as input and implement a direct scoring function. In part they do this because they can't anticipate whatever exotic model based on vectors you'll want to use. I'm not sure I'd advocate Elasticsearch going that radically in that direction.

For more manual cases, certainly a summation would be useful. Honestly, it could be hard to anticipate completely. In a way it's a different kind of text similarity comparable to a TF*IDF similarity. In spirit, it's a bit like analyzing the text/query, with a 'vector' representation coming out of the output, and a 'similarity' that does a dot product. Lucene has added OpenNLP analyzers that work with models to do POS tagging, etc in the analysis chain. Perhaps a token filter that uploaded a word2vec model and output a vector representation of the terms/documents somehow?

The problem is word2vec models are basically a giant matrix of every word/document and their corresponding vector representation. Very memory intense to keep in the search engine. Perhaps there's alternate embedding methods that are more memory efficient?

Alternatively, or in addition, I wonder if an explicit mapping type that took a vector would be useful? In your example, you use object notation, but could you also do just a list

"features": [20, 50, 12]

Or if you truly are looking to support vectors ranging from [-1.0, 1.0] then hide the details in Elasticsearch (whether its term freq or a multivalued numerical field):

"features": [-0.2, 0.8, -0.6]

Then a "feature" query (or whatever other 'dot product' query) could do the rest

Finally it's important to recognize these vectors could mean more than text, they could be:

So a general solution that let's you store and compute vector similarities could be useful for a lot of interesting 'search' applications

gibrown commented 6 years ago

There are two end use cases that immediately come to mind.

1) Popularity ranking of content

We often are trying to boost content based on some metric of popularity. Often it is a list of users who have liked or commented on something. By adding the user ids as a list of items we can give some weighting and somewhat correct for users who like tons of content due to the IDF. Controlling the TF though would allow us to give extra weight to particular users who tend to do a good job of recommending content within a specific category or on certain sites. We use this sort of ranking for search, related posts, and content recommendations

We've also looked at classifying sites by topic. Something like the Chinese Restaurant Process to group together similar sites. Then each site could be weighted by how much it belongs in a particular topic.

2) Applying additional NLP processing to search content. In particular searching for domain names

I recently convinced our team to move from developing on Neo4J to ES, but it was not a simple mapping because of the NLP they were applying. In particular they were doing:

We had to make a number of compromises to get it to work in ES. Got it working, but it was not easy to line it up with how data scientists really think about such problems.

Also, if you want my very open source, opinionated reason for wanting this for the long term:

While we can and should talk about filter bubbles and the impact that these algorithms have on the world, a world where only monopolistic tech giants can deploy these algorithms is not one where publishing is democratic.

-- Me

Ultimately, Lucene is tuned to process term frequencies very fast and efficiently. Controlling those term frequencies provides a lot of flexibility and speed vs the other work arounds.

jpountz commented 6 years ago

FYI i submitted a proposal of a new field type called feature_vector at #31102 which should address @gkinsman's and one of @gibrown's use-cases: the ability to weigh by tag/topic.

rpedela commented 6 years ago

@jpountz Is there an advantage or difference between using feature_vector instead of function_score?

jpountz commented 6 years ago

Yes, feature_vector is expected to perform better at computing the top-matches of a query than the same thing implemented via function_score.

Scoring functions are black boxes to function_score, so it has no choice but to look at all matches and compute their scores in order to compute the top-k matches. On the other hand, these new feature and feature_vector fields are designed to integrate will with Lucene 8.0's implementation of block-max WAND, which is an algorithm that helps skip matches that may not produce competitive scores. The basic idea is that if you search for a OR b, then the score is the sum of the score contributions of a and b. So if the maximum score contribution of a is, say, 2.5 and the k-th best hit that you have collected so far has a score that is greater than or equal to 2.5, then you won't miss any competitive hits if you dynamically mutate this query into a AND b, which is much more efficient since you can skip documents that don't match both terms. What these new feature/feature_vector fields do is that they index features into the term frequency of some terms so that boosting by a feature can be implemented by adding a term query for this feature as a SHOULD clause under the hood, just using a different similarity (the thing that computes scores) and block-max WAND can leverage it as it would with any other term query.

jpountz commented 6 years ago

Closing in favor of the upcoming feature_vector (#31102) and feature(#30618) fields.