Significant text - new aggregation type

markharwood commented 7 years ago

There are a number of concerns with using the existing significant_terms aggregation on text fields:

1) Memory costs: access to the text in documents is reliant on setting fielddata:true which is prohibitively expensive for most indexes. 2) Unintelligible results techniques such as stemming make raw index terms hard for users to interpret - terms may first need "de-stemming" for presentation back to end users. 3) Weak signals Individual terms aren't always useful - identifying significant phrases however can be much more interesting 4) Misleading suggestions Statistical analysis of word use in free-text is often hampered by duplications found in noisy data. Retweets, email reply chains or boiler-plate copyright notices tend to amplify the significance of the same paragraphs or any typos they contain. Removal of near-duplicate text in samples is required and fortunately similar documents rank similarly so duplicate text is often closely related in search result samples.

The core logic in the existing significant_terms aggregation originally came from a different project that did all of the above on free text and most of the above concerns were introduced as part of porting/trimming this logic to work with the data model imposed by the elasticsearch aggregations framework (designed for access to structured field data and doc values).

For this new significant_text aggregation I propose we examine Token streams from a sample of high-ranking documents where we have access to position information and raw text. Raw token streams can be obtained using the existing TermVectors api and by feeding this through a sequence of special TokenFilters we can further process the stream to remove duplicate paragraphs, spot phrase candidates and otherwise gather the information required to find useful suggestions from free text.

As a first step we could just look to introduce a significant_text aggregation that gathers the statistics for terms using the TermVectors api as an alternative to field data to address issue 1). This is the major pain-point for most users. The other issues 2, 3 and 4 are nice-to-haves and can be tackled at a later stage once we have laid the foundations of analysing streams of text that can be processed using TokenFilters.

markharwood commented 7 years ago

POC results showing the need for sampling/duplicate text removal

I've got a proof of concept running on the Signal news media dataset (Creative Commons 3.0 license, 1m news articles).

In the POC I can use significant terms without accessing fielddata - I re-analyze content of the top-matching docs to gather term stats. The results are a good example of the challenges of free-text analytics. Consider this search for news articles mentioning "elasticsearch":

I have to use a sampler agg to avoid re-tokenizing too many docs. The diversified sampler is used to try and eliminate copies of the same press-release based on a hash of the doc title. However this exact-match de-duping is inadequate. Consider this "significant" term in the results: currensee. If we drill-down with the highlighter we can see why "currensee" is statistically significant and it is down to near-duplicate docs:

This is a very typical scenario working with free-text where the same information is often copied and remixed and re-shared. Think press releases, news quotes from individuals, copyright notices/boilerplate, email reply chains, retweets etc. The DeDuplicatingTokenFilter first submitted in this PR is an effective means of trimming this sort of noise (at least on one shard).

We should consider whether near-duplicate removal is a necessary part of a first release of a significant_text agg. I have made the mistake previously of trying to lump too much functionality into a single PR.

markharwood commented 7 years ago

Some comments for future dev following an internal discussion:

Implement this as a new significant_text agg rather than adding to existing significant_terms Will allow for custom impl with custom settings.
The new agg impl should bypass the layers in the agg class hierarchy relating to aggs with a ValueSource. These classes were designed for structured data sources not processing TokenStreams of unstructured content. If necessary we can refactor later if we think other aggs would benefit from working on TokenStreams
First priority is an agg that works without requiring FieldData
Unravelling nested docs from stored JSON given a matching Lucene docID may prove challenging so we propose not to support nested docs initially.
Explain API may be required: advanced features like phrase-detection may be useful but rely on rationalization of suggestions e.g. given the candidate phrases "platform for social trading", "social trading" and "social trading platform" we might decide 80% of "social trading" uses were as part of the longer phrase "social trading platform" so we'd drop the "social trading" in favour of suggesting the longer "social trading platform". Conversely, the long phrase "united states of america" might be dropped because it is much rarer than the popular, terser form of "united states". These stats-driven choices may need to be explained in a special API

pkphlam commented 7 years ago

@markharwood Apologies for possibly hijacking a closed thread, but I just read the blog post on this feature and it is really interesting because it is quite similar to something that we have been trying to hack with Elasticsearch. The destemming for 2 is definitely a big issue for us.

I wanted to flag two other potential issues/features that we have been thinking about that would possibly be good additions to the list you already have for future implementations.

Using only terms within a certain window from query string matches. Essentially, if our query is a query string match, we've been thinking about ways to only consider terms that occur "near" a matched query term, where "near" can be user-defined. Think about really really long documents such as memos or patents, where there are many different sections, some of which are irrelevant. You get at this a little with the filtering of duplicate text, but there are times where the "noise" is not necessarily duplicate text. We would consider terms that are near one of our query terms to be more important contextually than terms that are very far away from our query terms, so some way to either de-emphasize or filter out terms that do not occur within a selected window of our query terms would be helpful. Think of it as filtering our document down to "snippets" and then drawing terms only from the snippets. Since you leverage the position information for each term, I would assume that this is possible.
Exposing term frequencies as a possible parameter for scripted scoring in addition to the document frequencies.

markharwood commented 7 years ago

Hi and thanks for the input!

Using only terms within a certain window from query string matches.

We could reuse the highlighters to help isolate the interesting sections of docs.

Exposing term frequencies as a possible parameter

I have an inherent distrust of individual documents - I find a more balanced view of a topic comes from looking at stats across many example docs. If a term is repeated legitimately (not a one-off webpage with spammy example of keyword stuffing) then this increased TF manifests itself from a higher percentage of docs likely to contain the term. The same argument might apply to only looking at "interesting" sections of pages. If pages consist of on-topic and neutral sections then we do not need to trim the neutral sections using a highlighter to identify keywords - the significance algo has the ability to separate the signal from the noise given a healthy sample of docs.

Do I detect from these suggestions that you have scenarios where the "foreground sample" being analyzed is perhaps a result set of only one document? When there are low-doc-count examples one approach I've used before is to internally consider a large doc as multiple fake docs segmented on sentence or n-words boundaries. You can then use the same significance heuristic algos which test percentage of foreground/background docs containing a term - you just happen to lie to it about what a "doc" is in the foreground set.

pkphlam commented 7 years ago

Thanks for the response!

We don't really use foreground sets of one document either. For the term frequency suggestion, it is less thought out, but we were thinking about ways to create new scoring metrics and one possible suggestion was comparing foreground and background TF-IDF type metrics in some way and it seemed interesting that TF was not exposed given that it is core to some of the other ES functions. But again, we were just starting to think about that.

On the suggestion for highlighters/windows, we actually currently do a version of your multiple fake docs example actually. We pre-segment documents before ingest and then we store each segment as a child in a parent-child setup where the main document metadata and text fields are indexed as the parent but the tokens are indexed as children, with each child document being one segment. We then run significant terms on the children doc segments that match the query. As you can imagine, while this sort of works, it is inflexible in that the segments are pre-set and thus not able to be tuned specifically to an even-spaced window around the matching terms. Also, matching the segments gets a little tricky because you often have multi-word queries where the full document itself matches the query but no specific child segment fully matches the query.

The reason why we try to do this segmenting on longer documents is because we try to differentiate between "substantively meaningful" significant terms and "discriminant but not substantively meaningful" significant terms. I do think that in many cases, you are correct in that the significance algorithm should be able to separate out interesting signal terms even with on-topic and neutral sections all combined. However, in practice we found that that is not always the case across all different document types.

For example, hypothetically, one might imagine a case where you have a large set of academic articles across different disciplines. You start with a query that is a small subfield or topic of one specific discipline. Now imagine that all papers have a "methodology" section and different disciplines talk about different methodologies in different ways. But the methodology terms itself are not necessarily that interesting. What is interesting is the substantive argument of the paper and the terms used in the arguments. What might end up happening is that some of the methodology terms end up being on your list (because they separate different disciplines well) and sometimes are even placed ahead of potentially interesting substantive terms depending on how the query was constructed. However, if you are able to cull the sections so that only the on-topic non-methodology sections are considered (here it's assumed that the interesting substantive parts of a paper have little overlap in terminology with the methodology parts of a paper), then you might get a better result. But in general though, I think in many cases it is often true that more interesting words lie closer to the query matches and so having the ability to isolate sections can increase the precision and quality of terms.

pkphlam commented 7 years ago

Also, I know I mentioned this already, but I wanted to re-emphasize the destemming (2nd) suggestion that you had, both in terms of how much it is a pain point for us and how much our customers desire it. Our product uses significant terms heavily (along with some machine learning) as a keyword recommender and time and again, the number one most impressive feature that our clients have mentioned that we provide is the ability to present words grouped in their stem groups. It seems like such a simple thing to think about, but there is very little out there that provides a stem -> original word(s) presentation.

It's also a huge pain point for us because to support such a feature, what we currently have to do is on data ingest, we have to hit the mtermvector API to get the word -> stem mapping, store all the mappings with counts in a separate SQL database, and then call the database every time we suggest words. It adds an incredible amount of ingest time and makes our software stack all the bit more complicated. If there was a way to do on the fly destemming where ALL words in the document set that correspond to a specific stem can be reproduced on the fly, I think it would be really helpful and unique.

markharwood commented 7 years ago

there is very little out there that provides a stem -> original word(s) presentation.

That's another thing highlighter logic can help with. It already identifies the sections of docs that produced stemmed forms and de-stemming is finding the most-popular examples of text associated with a stemmed form.

yoav2 commented 6 years ago

So currently there is no "de-stemming" in significant_text? Are there any plans to implement it? Are you suggesting to search for the returned results and take the marked highlights?

markharwood commented 6 years ago

So currently there is no "de-stemming" in significant_text? Are there any plans to implement it?

Correct - no immediate plans to look at this.

Are you suggesting to search for the returned results and take the marked highlights?

Currently we already retrieve and re-analyze the matching documents' source to find the significant words. Once determined, we could use something like a highlighter to find where the top-scoring terms lie in the source text and what the most popular form is for each stem.

yoav2 commented 6 years ago

Thank you. Other small qeustion - is there any performance differences between significant_text and significant_terms?

markharwood commented 6 years ago

When it comes to working with text, the differences are more around memory cost and results quality rather than performance. significant_terms relies on you having to load all your docs' text fields into RAM using fielddata - this is prohibitively expensive for systems with lots of docs. Any repetition of text (of which there's typically lots in real-world data) will skew the stats and you start showing oddities

significant_text does not rely on fielddata caching all of your docs' text and re-analyzes doc source on the fly meaning it can also remove duplicate sections of text, greatly improving the quality of suggestions.

eranhirs commented 5 years ago

@markharwood would love to hear your thoughts regarding significant_text on issue #42780

elastic / elasticsearch

Significant text - new aggregation type #23674

POC results showing the need for sampling/duplicate text removal