elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
68.68k stars 24.38k forks source link

Remove ability to gather index statistics in scripts (aka IndexLookup) #19359

Closed brwe closed 7 years ago

brwe commented 8 years ago

IndexLookup was added to have the ability to access term statistics from scripts. However, it has some limitations and it is not clear whether the effort to maintain the code is worth it (see also https://github.com/elastic/elasticsearch/pull/19334/files/7df2d06eb62dbbc5cc365153088993d70801c286#r70103016). I'll add the pros and cons as far as I know them here:

pros:

I'd be a little sad to see it go because it was useful for some of my experiments. But if I am the only one then so be it. I'd also be willing to work on a replacement.

clintongormley commented 8 years ago

Discussed in FixItFriday. Given that:

... we're in favour of removing this functionality.

brwe commented 8 years ago

it could be replaced by a similarity (or possibly a native Java script)

When I wrote scripts I meant native script as well. I agree we can use a custom similarity instead of a script for scoring but in all other places where we can use scripts I do not see any way to for example access term frequencies even within native scripts. I at least used IndexLookup there as well. For example I used it to get the term frequencies for text classification with naive bayes in an native script. I also used it to extract the term frequencies for a given list of terms to feed that information to an external tool.

when used within a script field and scroll it will be terribly slow because all the caching mechanisms to get for example document frequency are currently broken, see #17110

To clarify, #17110 applies to all lookups we have right now and is not just an inherent problem of just IndexLookup although it I think it hurts Index Lookup most. There is one problem for IndexLookup that makes it slower than it was in 1.x even if not used within scan and scroll: IndexLookup currently computes document frequencies over the whole shard once per segment and not once per shard as it was on 1.x. While this decreases performance on 2.x it is not horrible and it is not impossible to fix that.

Again, I'd be willing to work on fixing the current problems or on an alternative implementation to IndexLookup but I would first like to hear why people think it is so important to remove it. Also, if there is a way to replace this functionality right now with a native script please point me to it, I might just be missing it.

rmuir commented 8 years ago

If we really want to use scripting to implement similarity, IMO the simplest approach would be:

public class ExpressionSimilarity extends SimilarityBase {
   public ExpressionSimilarity(String expression) { ... }
   ...
}

It could just expose the simplified statistics of SimilarityBase via the ValueSource api, and should really be as fast as e.g. DFRSimilarity.

IMO it doesn't necessarily make Similarity any easier, because the true challenges are elsewhere: its really hard to get this stuff right. Many published algorithms struggle with this: e.g. especially with ES not removing stopwords by default, actual distributions are very unlike the expected and can cause problems in some cases. Even the classic BM25 IDF had an issue with this, but its not the only instance.

The code comments in DFR basic models give examples of problems:

 * WARNING: for terms that do not meet the expected random distribution
 * (e.g. stopwords), this model may give poor performance, such as
 * abnormally high or NaN scores for low tf values.

When possible we applied simple adjustments (e.g. BM25, DFR geometric model), but in other cases it is not obvious how or if things can even be fixed mathematically :) It is very expert stuff.

s1monw commented 8 years ago

++ to a scripted similarity (must be a fast script ie. expressions)

clintongormley commented 8 years ago

@rmuir that works for the scoring aspect. @brwe is also using these scripts to export term freq statistics for ML purposes. That was the original purpose of this feature. Any ideas how this part could be better implemented?

rmuir commented 8 years ago

The only thing i can suggest is to see if we can avoid mixing all these concerns in one api. It sounds like we have our current scripting api trying to tackle 3 different concerns:

"exporting frequencies for ML purposes" isn't concrete enough for my understanding to know what would be needed (e.g. would it just be a proper per-term api or something more complicated?, would it go out to all shards and really return those correct numbers?), but it would be great to separate this too.

I am suspicious of generic apis that claim to solve all ML problems. Lets at least separate this concern from complex issues like term weighting and script-based ranking and fields, because its nothing like those two things, and those two things are nothing like each other :)

brwe commented 8 years ago

"exporting frequencies for ML purposes" isn't concrete enough for my understanding to know what would be needed (e.g. would it just be a proper per-term api or something more complicated?

Here is what I did:

In the export I used IndexLookup to get the frequencies for training documents. In the script execution I used the term frequencies for new documents to execute the classification. I hope that makes a little more sense. Not sure in how much detail I have to get here.

BTW, this was a little bit of a hassle because in a way this is an intersection of terms in document and the list of terms I was interested in so I tried with a mixture of fielddata and term vectors and IndexLookup. Conceptually it seemed to me a similar problem to search just that the number of terms is huge and the "score" that is outputted can be several numbers as well. But I never got to convert that vague feeling into code.

rmuir commented 8 years ago

i dont understand how this really scales. if the term is common it may need to transfer millions and billions of things across the wire.

brwe commented 8 years ago

it would not be slower than scroll in general I think. or what do you mean?

rmuir commented 8 years ago

It doesn't lend itself to a top-N approach in any way, hence i think it makes sense to try to divorce it from the search API.

s1monw commented 7 years ago

so I think we are on the right path here, IMO what we are missing is:

lemme explain the latter, from my perspective the last remaining purpose of the index lookup was to have full control over the linear combination of the individual term scores. Now if you think about it its really a way to change what BooleanScorer does but with added flexibility? I think if we add a dedicated query that calls a script to calculate the linear combination for each subqueries score we are at the point that @brwe has everything she needs. the input of the script can simply be a double array holding all score of the subqueries where the index is their ordinal of how they are specified?

@brwe @rmuir does this make sense to you?

rmuir commented 7 years ago

Why does a user need to change how scores are combined? Mathematically what else make sense? Users often ask for multiply because they think they want that, but that is supported via sum of logs.

s1monw commented 7 years ago

I think, and please correct me if I am wrong @brwe that the context here is for instance something like logistic regression where your score is the result of some sigmod function (I know negative scores might be a problem here but there is a way to model that). The result of this could be used in aggregations to see how many docs fall into a certain class. I am really just making things up but I can totally see the value for experiments like this.

rmuir commented 7 years ago

As i stated above, I don't think we should piggyback such things on our script search api... sorry, its not search to me at all.

s1monw commented 7 years ago

As i stated above, I don't think we should piggyback such things on our script search api... sorry, its not search to me at all.

I think you misunderstood my comments, I am suggestion to remove the IndexLookup and extend ScirptQuery to also score. Since we already have a scorer on LeafSearchScript the script can just access it and call Scorer#getChildren and do whatever it wants? I mean all we need to guarantee is that the iteration order of this method is consistent?

rmuir commented 7 years ago

I sense my concerns may not have been communicated very well either.

Scoring APIs (in both lucene and elasticsearch) are a sensitive part of the system, a real hotspot. Can we please not try to wedge things like document classification into the scoring system? It has a difficult job already: serious challenges especially in ES (e.g. very slow scripting language as the default).

If we want to support document classification, lets do that with some other API. I don't think it makes sense to use the search datastructures, algorithms, and apis for this problem: but if we want to do that, let's please make that an implementation detail, but expose a very simple api to the user.

As far as things being "flexible", it scares me to hear that term. Please, none of this is personal to any developers, but this is like a raging infection in the elasticsearch codebase. There are far too many overcomplicated abstractions in this codebase, and despite that fact, a significant portion of our developers time is on refactoring. In my opinion, those abstractions are not helping out, instead they just make things complicated.

I really like how the linux kernel explains the problem[1](you kinda gotta translate the gist of this from C to Java or REST api or whatever is relevant):

* Abstraction layers

Computer Science professors teach students to make extensive use of
abstraction layers in the name of flexibility and information hiding.
Certainly the kernel makes extensive use of abstraction; no project
involving several million lines of code could do otherwise and survive.
But experience has shown that excessive or premature abstraction can be
just as harmful as premature optimization.  Abstraction should be used to
the level required and no further.

At a simple level, consider a function which has an argument which is
always passed as zero by all callers.  One could retain that argument just
in case somebody eventually needs to use the extra flexibility that it
provides.  By that time, though, chances are good that the code which
implements this extra argument has been broken in some subtle way which was
never noticed - because it has never been used.  Or, when the need for
extra flexibility arises, it does not do so in a way which matches the
programmer's early expectation.  Kernel developers will routinely submit
patches to remove unused arguments; they should, in general, not be added
in the first place.

[1] https://www.kernel.org/doc/Documentation/development-process/4.Coding

So instead of making the search api "flexible" enough to also do document classification, and instead of trying to make it "flexible" enough to support vague, ill-defined concepts like "ML purposes", I would prefer instead we added smaller, simpler concrete apis that address those needs. For example an API around document classification could be added that is very simple and easy to use, and has minimal parameters. Because its been separated from scoring, its no longer stuck with a priority queue + DAAT approach, it can use whatever datastructures it wants.

The search api already has enough problems, its already doing enough "dark magic", its already overwhelmed. Can we please separate this stuff?

s1monw commented 7 years ago

I am 100% behind that statement. I think we should try to simplify the entire scripting API etc. let detach this from vague statements and look at issues like https://github.com/elastic/elasticsearch/issues/17116 it's pretty much what I described above and if we say we go with #17116 to allow this flexibility on that issue then we can just move on. But apparently the ability to combine scores is something users would like to have. I am more than in favor to simplify the script API if we can to a minimum and use higher level APIs like in that issue to add flexibility that doesn't cost much of performance here

rmuir commented 7 years ago

I don't think we should do #17116 either. Instead of multiplying users can use the sum of logs.

brwe commented 7 years ago

For example an API around document classification could be added that is very simple and easy to use, and has minimal parameters. Because its been separated from scoring, its no longer stuck with a priority queue + DAAT approach, it can use whatever datastructures it wants.

Just to be sure: You mean an actual api like

GET index/type/doc_id/_classify

?

rmuir commented 7 years ago

Sorry, I don't really have a concrete design for how classification should work: which algorithms are best or how we should expose it in our APIs.

Just hypothetically, i would think of it as a document enrichment type of thing, so maybe we start from say an ingest processor or whatever, that adds a new field (the assigned "classification") to the document, and then figure out from there how it should work. That would imply classifying documents before they are indexed at all, based on some model, whatever that is.

Maybe that model is something complicated, and based on a bunch of index statistics from a separate index ("the training set"). Maybe its just a decision tree based on a big pile of regular expressions. Or maybe its best to leave tasks outside of ES to tools like nltk. Honestly, I have no idea :)

brwe commented 7 years ago

Maybe that model is something complicated, and based on a bunch of index statistics from a separate index ("the training set"). Maybe its just a decision tree based on a big pile of regular expressions. Or maybe its best to leave tasks outside of ES to tools like nltk.

Ok, so here is another approach: Say we had a "scripting language" that can do nothing but read model parameters for trained models (trees, logistic regression, whatever) that come from somewhere external and then execute the read model on a document. So it would not really a scripting language because the "scripts" are really just parameters and the execution is hard coded but it would use the scripting mechanism so it can be used in aggregation, ingest etc. Would that be far away enough from search? Could we make it so that at least this "scripting language" would have access to term statistics?

rmuir commented 7 years ago

I don't understand the desire to use the scripting mechanism as the way to do this. Why wouldn't we just have the proper hooks in those apis (e.g. add an aggregation component, add an ingest component) to do this instead? e.g. bundled as a plugin.

Doing it with the scripting api (and plumbing it via scoring) adds unnecessary complexity and pressure to those components.

s1monw commented 7 years ago

ok given that we should scoring for scoring ;) and we have some really strong arguments towards not mixing concerns while rather minor arguments towards opening things up more for satisfy the usecase of ML or classification we should remove this API without a full replacement. I would still like us to explore the way of a scripted similarity, I think this could help quite a bit for experimenting with scoring. @rmuir WDYT?

rmuir commented 7 years ago

ok given that we should scoring for scoring ;) and we have some really strong arguments towards not mixing concerns while rather minor arguments towards opening things up more for satisfy the usecase of ML or classification we should remove this API without a full replacement. I would still like us to explore the way of a scripted similarity, I think this could help quite a bit for experimenting with scoring. @rmuir WDYT?

Are users of ES typically information retrieval researchers? I can't think of who else would need that.

Like i said: if we really truly need it, then ExpressionSimilarity is the way to do it. However, I feel like all the time i would invest in such a feature may be wasted: nobody would ever use it :(

It really should not be so hard to remove things in elasticsearch. This issue is a classic example! That's why elasticsearch has half a million lines of code.

s1monw commented 7 years ago

Like i said: if we really truly need it, then ExpressionSimilarity is the way to do it. However, I feel like all the time i would invest in such a feature may be wasted: nobody would ever use it :(

can it be a lucene feature, I could see some folks use it...

s1monw commented 7 years ago

well to begin with I think lets remove the IndexLookup and then we can have individual issues about improvements and then we can still decide if it makes sense to add such a similarity.

softwaredoug commented 7 years ago

Just want to chime into this issue as a practioner. I have used term statistics in scripts and find them very handy. The use case that comes to mind is a small collection, hosted somewhere that doesn't allow plugins, that cares about relevance.

Sure In a larger scale system, scripting isn't a great answer. But many folks do search at a smaller scale. These orgs look at you sideways when you start talking about maintaining custom scorers and similarities for them. They are terrified of Java and Lucene. And they don't want to keep having to hire someone to maintain even a simple similarity.

Second Elasticsearch seems to be being used as a general purpose matching/ranking system in analytics solutions, not strictly a search engine. I'm not just speaking from my own experience at OSC, but from Elastics own marketing. Often when doing weird things like searching DNA or the predicting weather or building recommenders what a "term" is becomes a data modeling exercise.

In these cases erring towards a flexible "frameworky" approach IMO works best. Allowing several options to perform arbitrary scoring computations is really important for these users. I don't think the flexibility argument is merit less based on the very cool things I see folks doing in Lucene.

So my bias as a practitioner, not as someone steeped in the ES code base, would be to caution you against any change that didn't have an alternative for the folks that don't want to or can't maintain Java plugins. And something that let the non text use cases continue to do well with Elasticsearch.

I really like the idea of scripting scorers and similarities. That removes a lot of Java code and adds a ton for relevance use cases.

softwaredoug commented 7 years ago

Ive seen it mentioned a few times here uncertainty why why users should need to control how scores are combined. I think it's vital. Lucene based search has been successful compared to commercial offerings precisely because of how much you can customize its behavior. And the more of that I can do outside of Java, the easier it is for me to implement that power. (Otherwise these days I can just go plug in algolia or something and be done with it.)

For example a "term" in Lucene world as I'm sure everyone has seen may have nothing to do with text. It might have something to do with product affinities in a recommendation system. Or ngrams of DNA. Or identifiers in some taxonomy. Or something even more exotic. And who knows then what the term stats around it are supposed to mean and how they might be used? In this context, the more control one can have before having to deal with custom plugins the more successful ES can be at solving the meaty search probs out there.

A high quality ES solution is still only as free as your time is (or as powerful as your time investment is). So while I do see more "IR researchers" in pure Lucene and Solr land, there's still quite a few using Elasticsearch precisely because it hits a nice sweet spot between control and ease of use compared to Solr or pure Lucene.

So again :+1: to scripted scorers and similarities. I see a lot of power there.

(PS I'm just trying to offer a "outside" practitoners devils advocate for your efforts. Totally get you guys have a large complex code base to maintain and can empathize with the need to tidy things up and avoid bugs. Keep up the great work! :) )

rmuir commented 7 years ago

For example a "term" in Lucene world as I'm sure everyone has seen may have nothing to do with text. It might have something to do with product affinities in a recommendation system. Or ngrams of DNA. Or identifiers in some taxonomy. Or something even more exotic. And who knows then what the term stats around it are supposed to mean and how they might be used? In this context, the more control one can have before having to deal with custom plugins the more successful ES can be at solving the meaty search probs out there.

Sorry this argument just does not hold. For example BM25 tells us exactly how to handle such "non-textual relevance features". Please read it.

http://www.staff.city.ac.uk/~sb317/papers/foundations_bm25_review.pdf

softwaredoug commented 7 years ago

Great article Rob, I believe I've read it in the past. BM25 is pretty awesome for what it's built for (document search)

The area you cite provides a very general framework for a linear combination of scalars. It says

Here are some Vi functions that we have used with success in the past for different features (lists sigmoid, log, etc)

The linear combination looks like the cosine similarity summation under the Bool query. But most of Elasticsearch's documentation (and my experience) points to multiplying text relevance by these scalar factors instead of trying to treat them as another "boolean clause" to optimize for as the first tool to reach for. Indeed, this is what Elasticsearch's function_score_query does by default. It's likely both approaches have their applications.

Additionally, like I said BM25 is proven at general document search. This is only one application of Elasticsearch. I still believe Lucene-based search's sweet spot is as a framework in that (1) it's focus is on deep configurability (2) it offers a variety of ways of thinking about and configure "relevance". People want to use Elasticsearch to build recommendations systems, perform image search, and search chemical structures. They have something unique to what they're doing, that's why they didn't grab an off the shelf product that did what they wanted.

I don't want to derail this thread. I'm just trying to offer my support & perspective on where I see Elasticsearch in the marketplace when people are making the "buying decision." Little features like tf in scripts seem like a small thing, but in some sense its these little bits of configurability that sell me Lucene-based search as opposed to a commercial "plug and play" model where all the decisions are made for me.

rjernst commented 7 years ago

I've been thinking on how to allow looking at index stats for "advanced scripts", but without needing IndexLookup and I think the answer falls in line with #19966. Without IndexLookup, the script engine api will still take the LeafReaderContext (which is currently used to construct the LeafIndexLookup). Replacing native scripts with advanced users simply writing script engines means they would still have access to statistics by using the LeafReader from LeafReaderContext. IMO this is better anyways, as LeafReader is well documented, and does not have the overhead that IndexLookup incurs even when it is not used (setDocument is always called, advancing various iterators, even when they are not used by a script).

MLnick commented 6 years ago

Hey folks

Late coming to this but I used indexLookup in a native script in my vector-scoring plugin: https://github.com/MLnick/elasticsearch-vector-scoring.

This use case doesn't access TF/DF, but needs to access term vector (specifically PAYLOADS).

I see there might be a way to use ScriptEngine to implement this now. But it does seem that other than a native script there is no way to do something similar using painless. (Well, other than using source and I think that would be extremely slow? It was very slow before when I tried a version using that).

But users who are using cloud deployments typically have locked down ES clusters where they aren't able to install custom plugins. So under the new ScriptEngine this use case is impossible for these deployment scenarios (e.g. see https://discuss.elastic.co/t/how-do-i-access-term-vector-in-painless-scripting/84579).

rjernst commented 6 years ago

@MLnick There is a tradeoff we must balance between ease of use and customizability. We tend to favor ease of use for the common cases, and are happy to add customizability with less ease of use. In this case, the complexity of maintaining access to these very low level and advanced structures within lucene was not worthwhile. As you said, one can still access these with a custom ScriptEngine. And the Elastic Cloud does support custom plugins.

mickeysjm commented 6 years ago

Hi all,

I want to make sure my understanding is correct. In ES 5.4.0, the only way to access index statistics in scripts is to write a custom ScriptEngine which can only be written in Java. If that is the case, I want to further ask whether there is a way to force the usage of Groovy in ES 5.4.0 assuming it has not been completely deleted.

Thanks.

rjernst commented 6 years ago

@mickeystroller Please ask questions on our forum. We use github for feature requests and confirmed bug reports. Commenting on closed issues often results in missed messages. It is best to open a new discussion on the forum for a question such as this.