elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.42k stars 24.56k forks source link

Support for highlighting extracted entities. #29467

Closed markharwood closed 5 years ago

markharwood commented 6 years ago

Background

Highlighting and entity extraction are cornerstones of search systems and yet they currently do not play well with each other in elasticsearch/Lucene.

The two techniques are often used together in systems with large amounts of free-text such as news reports. Consider this example search which combines free-text and a structured field derived from the free-text:

entitysearch

In this particular example highlighting works for the entity Natalia Veselnitskaya but would not for the entity Donald Trump Jr.

Issue - brittle highlighting

Sadly, the structured keyword terms like "person" produced by entity extraction tools rarely exist as tokens in the free text fields where they were originally discovered. The traceability of this discovery is lost. In the example above the natalia veselnitskaya entity only highlights because I carefully constructed the scenario:

1) I lowercase-normalized the person keyword field's contents 2) I applied lowercase and 2 word shingles to the unstructured text field

This approach was a good suggestion from @mbarretta but one which many casual users would overlook and still is far from a complete solution. Donald Trump Jr. would require a 3 word shingle analyzer on my text field and one which knew to preserve the full-stop in Jr. - but I don't want to apply 3 word shingles to all text or retain all full-stops. This is clearly a brittle strategy.

The irony is that entity extractors such as OpenNLP, Rosette or even custom regex have the information required to support highlighting (extracted entity term and offset into original text) but no place to keep this data. Entity extraction is really concerned with 2 or more fields - an unstructured source of data and one or more structured fields (person/organisation/location?) to deposit findings. Due to the way Analysis is focused on single fields we are left with no means for entity extractors to store the offsets that provide the traceability of their discoveries which standard highlighters can use.

Possible Solutions

1) "Internal analysis" - entity extraction performed as Analyzers

(I offer this option only to show how bad this route is...) If the smarts in entity extractors were performed as part of the Lucene text analysis phase they could potentially emit token streams for both structured and unstructured output fields.

2) "External analysis" - entity extraction performed prior to indexing

In this approach any entity extraction logic is performed outside of core elasticsearch e.g. using python's nltk or perhaps reusing human-annotated content like Wikipedia. The details of discoveries are passed in the JSON passed to elasticsearch. We would need to allow detailed text offset information of the type produced by analysis to be passed in from outside - akin to Solr's pre-analyzed field. This information could act as an "overlay" to the tokens normally produced by the analysis of a text field. Maybe text strings in JSON could, like geo fields be presented in more complex object forms to pass the additional metadata e.g. instead of:

"article_text": "Donald Trump Jr. met with russian attorney"

we could also support this more detailed form:

"article_text": {
    "text": "Donald Trump Jr. met with russian attorney",
    "inject_tokens" : [
          {
                "token": "Donald Trump Jr",
                "offset": 0,
                "length": 16
          }
    ]

}

A custom Analyzer could fuse the token streams produced by standard analysis of the text and those provided in the inject_tokens array.

elasticmachine commented 6 years ago

Pinging @elastic/es-search-aggs

spinscale commented 6 years ago

+1 on the second approach, if we decide to do this - this allows us either to use ingest processors or have a completely external NER. Using lucene analyzers also would mean, that we have to keep the models in all Elasticsearch data nodes.

markharwood commented 6 years ago

I've experimented with the example of object-based representations of text strings in JSON (with text and inject_tokens properties) and can see that there's some potential issues. I had to modify TextFieldMapper to accept object-based strings for indexing and for highlighting to work I had to hack the SourceLookup.extractRawValues method to look for body.text in the source map if the user asked for the body field to be highlighted . It's not clear to me how users should refer to the body field when making requests for highlighting or source filtering etc - should they refer to body string fields or body.text? Also, the majority of the time in things like hits or top_hits aggs the users are unlikely to be interested in seeing all the inject_tokens echoed back as part of the source - these are more of an indexing detail for the search engine rather than useful information for the end user.

mayya-sharipova commented 6 years ago

@markharwood very interesting issue and usecase.

Also +1 for the second approach.

It would be cool if we can add even the type of token:

"inject_tokens" : [
          {
                "token": "Donald Trump Jr",
                "offset": 0,
                "length": 16,
                "type":  "person"
          }
    ]

Have you thought about designing a special type of query for this, something like this:

"query": {
    "entity" : { "person" : "Donald Trump Jr" } 
  }

About the implementation details: it would be cool if we can add an additional tokeStream to a field besides the one traditionally analyzed. The custom similarity modules can combine both tokenStreams in a custom way. I have heard similar requests about possible ways to index vector embeddings with a field content. Some people are using payload for this, but it doesn't scale well.

markharwood commented 6 years ago

It would be cool if we can add even the type of token:

Yep, anything a regular Token produced by internal analysis can contain should be on the table. I don't think we do anything with Token type info at query time but that's probably for another issue.

it would be cool if we can add an additional tokeStream to a field besides the one traditionally analyzed.

My assumption is that we'd have to emit one fused TokenStream at index time containing a combo of the internally-generated tokens and the externally-provided ones.

Also, anything that re-analyzes _source text strings at query time will need to repeat this stream-fusing logic. Candidates may include:

I'm not sure how best to handle this - either rely on persisted tokenisation e.g. TermVectors for these classes to work properly or try abstract the way TokenStreams are obtained from a fusion of on-the-fly Lucene text analysis and the externally-provided tokens.

markharwood commented 6 years ago

I discussed this with @jpountz @jimczi @romseygeek and others and the suggestion was that we should ideally store the externally-provided entity tokens in a separate "annotations" field rather than splicing them directly into the tokens of the text field. These annotations should be thought of as similar to the fields sub-properties in a field mapping - alternative ways of indexing the original JSON text. Let's call both of these concepts "indexed variants" or IV fields for the moment.

There are a number of existing challenges with support for IV fields which we might choose to address in spin-off issues: 1) Highlighters need to be able to highlight the original text using one or more selected IV fields that contain the tokens used by the user in the query 2) Positional queries (Spans or the new interval queries) need to support finding "X near Y" where tokens X and Y may be stored in different IV fields. 3) Other forms of query-time analysis such as the significant_text aggregation, the MoreLikeThis query or ML's CategorizationAnalyzer may need to support the selection of one or more IV fields.

In 1) and 3) above, the new "annotations" IV field would break all their existing assumptions about how to get hold of a token stream for original JSON strings. It is not enough for them to pass the original text to the field's Analyzer for it to re-tokenize. The "annotations" IV field would require callers to pass a different context to retrieve the externally-provided list of tokens. This context might be a Map of the source data, some Lucene object like Document - it's unclear to me how we would abstract this more cleanly - especially when we consider the "TermVectors" alternatives for providing any pre-stored tokens. It is perhaps also worth formalising the connection between the annotations field and the source text to which it relates. Maybe with a special definition in the mapping. This would help us validate that when highlighting the foo_text JSON field, the "foo_annotations" field is an appropriate choice of IV field whose positions and offsets still relate to the foo_text content.

markharwood commented 6 years ago

Damn. If we adopt the "separate field" approach to storing entity annotations (as suggested in my previous comment) then we can't use positional queries like span/interval. These queries measure token proximity using position increments, not character offsets. Using a separate field to store entity annotations would mean that it would be impossible to record position inc values that tied up with those tokens record in the text field.

For the record - the highlighting and positional query capabilities I would hope to enable are demonstrated here: https://youtu.be/kbK3D_pULd4

jpountz commented 6 years ago

If we adopt the "separate field" approach to storing entity annotations (as suggested in my previous comment) then we can't use positional queries like span/interval.

I'm not sure this is a blocker. For instance I could imagine that we could merge-sort two token streams in order to reconstitute one that has both the raw tokens and extracted entities. I vaguely remember @romseygeek talking about something like that but could be wrong.

markharwood commented 6 years ago

Inline vs external tagging styles

NLP tools such as Apache UIMA and GATE can export a rich-text format of annotated text where any annotations are in-lined around selected text by introducing special markup (traditionally XML) to identify items of interest. This is similar to how HTML uses <a href="foo.com"> tags to introduce hyperlinks around selected text. The advantage is that no position offset information is required which can be brittle when different character encodings are used between systems. The disadvantage is that XML-like structures may be hard to express in JSON. Perhaps the {{...}} style of escaping annotations text popularised by HTML templating engines may be another approach. Whichever escaping format is used, this approach would rely on elasticsearch supporting a new rich text format which keeps both text and annotations together.

The alternative format is offset-and-length based annotations such as those provided by OpenCalais where any annotations are listed separately from the text using tags, positions and offsets to reference areas of the original text where entities were discovered. This approach would rely on elasticsearch supporting a new "annotations" field type and defining the text field to which it relates in the mapping.

mbarretta commented 6 years ago

Another approach I've seen in the past was a graph-style annotation schema that took the raw text as a single field and attached various low-level analytics (tokens, mainly) as nodes pointing to offsets of that text and higher-level analytics (SBs, POS, ultimately entities) pointing to the lower-level nodes or to "beginning" and "ending" nodes.

The spec for it is here: https://github.com/pagi-org/spec/blob/master/pagi.md

markharwood commented 6 years ago

Not heard of pagi before, thanks Mike. My gut feel is that we should aim for simplicity here and not try support a) overlapping annotations in the text b) annotation type hierarchies (eg person ->organisation) c) annotation relationships (e.g. text declares annotation 1 has "employed_by" relationship with nearby annotation 2) d) multiple annotation properties e.g. person annotations possibly having "name", "age", "gender" attributes etc

The simplest option is to support an annotation as a single Token (string + pos + offset + len) whose string value serves as both a (hopefully) unique ID and human-readable label. The dandelion NER plugin for example uses Wikipedia URLs as a string that is both a unique ID and a human-readable string which contains the entity name.

markharwood commented 6 years ago

@jpountz @jimczi We had a meeting on this and came up with the following decisions:

1) Annotated text should be presented inline using tags

A new field type is required ("annotated_text"?) which accepts strings that are interspersed with markup e.g.

"text" : "They met with <a type=`person` value=`Donald Trump Jr.`>Don junior</a> and ..."

The exact tag-escaping mechanism was not decided (thoughts, @clintongormley ?) but it should allow users to express a type and a value which would be indexed as a single token at the same offset and position as the text it surrounds. The value would be indexed as-is so not lower-cased or otherwise formatted. The type information may appear as a payload or possibly a prefix on the value (as yet undecided). The advantage of the in-line tagging format is that external clients would not have to pass any offsets and lengths for annotations which can be problematic when translating between client and server character encodings. In-line tags also mean we will not support overlapping tokens.

2) Annotations are additional tokens indexed into the same "text" index field

Entity annotations can be thought of as synonyms that expand on the text tokens recorded during analysis. Advantages of putting them in the same text field rather than a separate "text.annotations" indexed field are: a) existing highlighters work with a single token-stream and not a fusion of multiple indexed fields b) Positional interval queries also only work with a single indexed field. The disadvantage is that traditional text-based queries may unexpectedly highlight/match annotation-introduced tokens. We felt this could be mitigated if the annotations tokens adopted a convention like type-prefixing that would ensure there weren't unintentional matches.

3) May only work with one choice of Highlighter impl

The challenge of working on the text of a field that contains both text and annotation markup may mean that we have to "special case" the new annotated_text fields so that they only work with one highlighter impl - possibly a variant of the unified highlighter

4) Punted for later releases:

a) We won't be allowing in-text annotations to declare "copy_to" commands for any structured keyword fields. This is a convenience for clients that would be hard for us to implement and they do have a work-around. Clients could instead pass JSON source that included identical person names in the text annotations and the structured person keyword field.

b) We won't be offering a means to ask for hits in results that return the text field without the annotation markup. I can see people might want it but we'd need to work out a way to make this clean.

c) MoreLikeThis and Significant Text agg both try to identify statistically significant items in text, doing so using re-analysis. Whether this should include any annotations is perhaps open to interpretation so we did not decide on a policy or any additional user-facing controls that might be needed here.

markharwood commented 6 years ago

I've opted for encoding annotations in text using a markdown style syntax. So the original text appears in between [ ] followed by a url-like syntax in between( ) used to describe the entity value and type e.g.

 "text" : "They met with [Don junior](type=person&value=Donald%20Trump%20Jr) and ..."

Note that the type and value of the entity are url-encoded parameters. Note also markdown is a permissive syntax meaning that regular uses of [ in the text don't have to be escaped and it is only when the use of these characters matches the pattern [...](...) that it is interpreted as being a URL, or in our case, an entity reference.

markharwood commented 6 years ago

I'd like to add the option of injecting multiple annotation tokens for a given piece of text.

image

e.g. in the highlighted text above I want a token to identify both the person and the role.

The question is how to encode multiple tokens in the annotation's markdown-like syntax. I can think of 3 approaches:

1) Simple key/value pairs In this syntax the token type and value are collapsed into simple key/value pairs:

 `he paid [John](person=John+Smith&role=payee)`

2) Multiple numbered token properties In this syntax the type and value (and potentially other attributes) for each token are associated using numbers

`he paid [John](type1=person&value1=John+Smith&type2=role&value2=payee)`

3) More complex encoding We could introduce extra escaping into the url-like syntax to have comma-delimited list of annotation attributes or perhaps use JSON curly braces instead of the (url) syntax e.g. paid [John]{...}

I like the simplicity of 1) but it does preclude having any token attributes other than type and value - we couldn't for instance introduce anything that added payload information in future. Currently we only use value in the search index - the type part of a token only has potential use in clients rendering this text in type-specific ways.

Proposal

We should reserve [](...) syntax for the simple key/value syntax and use []{...} for any advanced JSON-like syntax we may come up with in future.

markharwood commented 6 years ago

"Copy_to" from annotations to structured keyword fields looks like it may be tricky.

Copy_to impl works by passing the same JSON property text to multiple fields (see DocumentParser.parseCopy(field, parseContext) ). Each target field currently reparses the original JSON string (text-plus-markup). Ideally we'd pass only parsed annotation token values to keyword fields based on the token type e.g.

{
    "my_annotated_field": {
        "type": "annotated_text",
        "copy_annotation_types_to": {
            "person": "my_entities_keyword_field",
            "role": "my_roles_keyword_field"
        }
    }
}

Adding this would require a change to DocumentParser to allow annotated fields to pass back "virtual" document properties that are just the annotation token values presented to target fields as if they were there in the original JSON.

Proposal

Copying annotations to structured fields looks too messy to attempt inside elasticsearch. Maybe an ingest pipeline processor that understands the annotation markup syntax is a better way to copy these fields values around. Certainly tools like our open NLP already do this kind of copy-to type logic when extracting entities from the raw text. In future that tool and others like it will likely automate the process of both marking up the annotated text and copying discoveries to structured fields in the JSON. Any human-authored docs may still get it wrong (e.g. forgetting to add an annotation value to a related structured field) but I expect the majority of value-copying will be done automatically by upstream tools in practice.

jimczi commented 6 years ago

We could also create extra fields directly in my_annotated_field, it could be per annotation type my_annotated_field.person, my_annotated_field.role or a single field my_annotated_field.annotations and add the type as a prefix in annotation person#madonna ? I think this field should be able to handle the indexation of the text and the annotations in a doc_values field automatically otherwise you'll need to handle the format in a lot of places.

markharwood commented 6 years ago

and add the type as a prefix in annotation

It certainly would be nice to exploit the type information in the annotation tokens. This is something I think is generally missing in our existing mapping definitions - client tools such as Kibana don't appreciate that the values found in, say, the from keyword field can also be used in the to keyword field because the tokens both represent the same entity type (email address).

Nothing in the mappings declares these fields' tokens are interchangeable. I'd like to see this entity-type information in the mapping alongside the existing choice of storage-type (eg keyword). Generic clients like Kibana could then understand how discoveries in one field were exploitable in other fields for search or highlighting purposes. Annotated tokens are perhaps the first place in elasticsearch where an idea of entity type (person, organisation etc) is introduced independently of the type associated with the field that contains it. It would be nice to carry this "entity type" info further into our mappings.

markharwood commented 6 years ago

I think when highlighting an annotated_text field it will be useful to return hit information using the annotation syntax.

Benefits

1) Rather than plain <em></em> tags we can pass extra "hit" information in the url-parameter like syntax such as the actual search term that matched and possibly scoring weights. An example hit for a search for "tesla" might be marked up as follows: brand new [Tesla](_hit.term=tesla&_hit.score=3.32) launched. 2) It would also be useful to return any other non-matching annotations from the original text e.g. in this Wikipedia example below the only thing highlighted in yellow is the searched text but the original JSON contains multiple people annotations which would be useful to have marked up in the client too: image

A sophisticated client (Kibana?) could make good use of all this extra metadata embedded in the text, rendering results with hyperlinks, different colours, font-weights etc.

Downsides

Approach

The implementation would be a special PassageFormatter for the existing UnifiedHighlighter. When mixing search terms and pre-existing annotations in the final markup the rules would have to be as follows:

markharwood commented 6 years ago

Type-less annotations are now possible

To make life simple it is now possible to have annotations of this form:

[Cook](Tim+Cook&CEO) announced the new iphone

In this case the annotation values (Tim Cook and CEO) are just separated using & characters and injected directly into the token stream. We no longer require this type=value syntax:

[Cook](person=Tim+Cook&role=CEO) announced the new iphone

The above syntax is still supported because, even though the types (person and role) are not used at all by the elasticsearch server, a client may have use for these when rendering the text and may want to use person icons etc.

The docs now promote the "type-less" syntax which should help remove some of the confusion around how any type information may or may not be used or not in the indexing process.

markharwood commented 6 years ago

(Following up on https://github.com/elastic/elasticsearch/pull/30364#issuecomment-411266742 )

It seems like the main use case for the simple format ([the company](Apple+Inc.)) is to provide a resolved entity. Here it makes sense to index the value without a special prefix

There's always potential for annotations (resolved or otherwise) to clash with text tokens eg our own [the company](elastic)

How people search

it’d be most natural to just search for location=Beirut

One thing to note is I expect people don't naturally type these tokens into searches. The query parsers we have assume free-text and would mangle the input. Rather, I see people selecting tokens offered up in structured drop-downs, histograms etc and wanting to use those in queries, highlighting the sections of text that provide the context/evidence of where these locations were mentioned in results. If you have a Kibana bar chart of "top location.keyword values" it's frankly ugly if the bars are labelled location=X and location=Y. The "location" part is already implicit in your choice of structured field for the bar chart.

Typed tokens are a broader issue than annotated_text

Speaking more generally, we have work to do in adding more type info to fields. Even with structured data there is nothing that tells a generic tool like Kibana that tokens used in the fromEmail field have currency in the toEmail field and can be used interchangeably when exploring data, drawing graphs etc. We know they are both keyword fields, just like the user-agent field but have no clue that the type of tokens they store are emailAddress and are compatible. It may be worth branching an issue for this.

Proposal

I suggest we drop the support for typed annotations and go with the type-less syntax in the first cut. Because user values are url-encoded we shouldn't see any "=" characters in text markup meaning we can always safely introduce a typed key=value syntax later once we've figured out how best to handle typed annotations and the idea of token-typing more generally. The challenge of escaping annotation values to avoid clashes with text tokens (or not, if that's desirable) is then a user responsibility. They don't have to second-guess how we might mangle any type info into indexed tokens (they'd need to recreate that scheme when they include the values in a structured field like people that sits alongside the annotated_text field and is used for aggregations).

jtibshirani commented 6 years ago

Rather, I see people selecting tokens offered up in structured drop-downs, histograms etc and wanting to use those in queries...

Got it, I don't think I had the whole workflow straight in my head. To clarify my original comment, I wasn't imagining that the user would input the token location=Beirut directly, but rather that there would be some query processing step that would produce these tokens (maybe the same annotation pipeline is even run over the text of the query).

I suggest we drop the support for typed annotations and go with the type-less syntax in the first cut.

This makes sense to me -- it seems nice to keep the first version simple and focused. Users won't get ideas like I did about adding all other sorts of annotations :)

markharwood commented 6 years ago

maybe the same annotation pipeline is even run over the text of the query

That could be tricky. The analyzer expects annotation values in the [Foo bar](Bar&Baz) format and the characters [](&} are already reserved characters in parsers such as the Lucene query syntax. Relying on arcane query syntax/parsers doesn't seem like the way forward for me. Here is an example where the query is expressed by dragging the annotation hyperlink directly out of a document's text into a visual query builder. I think there's more work to be done in visually emphasizing the difference between "thing" and "string" clauses but the idea of using visual query builders makes more sense to me.

Users won't get ideas

Ideas are great! Feels right to see where we they lead us next if we start with just the simple approach.

jpountz commented 6 years ago

I suggest we drop the support for typed annotations and go with the type-less syntax in the first cut.

+1

The challenge of escaping annotation values to avoid clashes with text tokens (or not, if that's desirable) is then a user responsibility.

Agreed, but then let's make the docs give examples with a format that reduces clashes, eg. [the company]{company:Elastic} rather than [the company]{Elastic}.

Because user values are url-encoded we shouldn't see any "=" characters in text markup meaning we can always safely introduce a typed key=value syntax later

That feels like a dangerous assumption to me. Let's reject equals signs explicitly?

markharwood commented 6 years ago

Let's reject equals signs explicitly?

The issue here is we don't tend to reject noisily - the syntax is somewhat permissive in that we don't throw errors if we find [..] and don't have a corresponding (...). Do we decide to get picky about use of [...](x=y) markup?

I guess the options are: 1) Reject annotation with error 2) Reject annotation with no error 3) Ignore any key part of a key=value pair.

markharwood commented 6 years ago

Re-think: clients want types

I had a chat with @colings86 and we agreed annotation type is useful to client apps. More generally, structured fields could also use some entity type metadata - hopefully the annotation's idea of token type would align with entity types assigned to structured fields. Knowing an annotation is type "movieID" would, for example invite drill-downs on a structured field known to hold entities of the type "movieID".

How to handle types in annotations?

Colin and I assumed that type=value would still be the syntax to use in the annotation markup e.g. person=Donald+Trump but the question remains what if anything elasticsearch does with this type info. We need to pick from one of these possible options:

Option No. Indexed value Indexing notes Example client searches
1 person=Donald Trump The type information is ignored by elasticsearch and considered part of the value new TermQuery("text", "person=Donald Trump")

new SpanTermQuery ("text", "person=DonaldTrump")
2 Donald Trump The "person=" type information is stripped by elasticsearch. new TermQuery("text", "Donald Trump")

new SpanTermQuery ("text", "DonaldTrump")
3 ??? (hidden, subject to change) Elasticsearch encodes type info with the value but clients must use new TermQuery and SpanTermQuery variants designed to hide the implementation details of how type is encoded with the value in the index new AnnotatedTermQuery("text", "person", "Donald Trump")

new AnnotatedSpanTermQuery("text", "person", "Donald Trump")

Option 1 is pretty transparent. However, clients would need to remember when drilling down from structured-fields to prefix the selected values (eg Donald Trump) with the field's entity type (eg person). Note also that the idea of a field having an "entity type" is not something Kibana knows about currently.

Option 2 is less work for clients (no prefixing of values is required) but suffers in that searches for annotation terms may over-match with text tokens or annotations of different types. Much depends on the global uniqueness of the choice of annotation values. In the case of Wikipedia, article IDs are always unique across types anyway e.g Mastodon vs Mastodon (band). Clashes with Wikipedia article IDs and regular indexed text tokens were minimal - only one of the 500k+ person article IDs clashed with indexed plain-text values and that was a producer called 1.8.7 who clashed with an article on Ruby version 1.8.7.

Option 3 is similar to option 1 but buys some flexibility in the choice of type-encoding strategy at the cost of needing specialised query classes that encapsulate this decision.

Can we pick one of these strategies @jpountz @jimczi @colings86 so we can put this to bed?

markharwood commented 6 years ago

My vote for the above is option 2 - stripping types. The other options use type in the index and rely on a client that holds and understands entity types for each field - I don't think Kibana is going to be that smart in the short term. I propose the docs still offer "untyped" examples of markup and the code stays as-is, stripping any type info from values (which in my view is the least-worst of the validation choices outlined in https://github.com/elastic/elasticsearch/issues/29467#issuecomment-412901293)

If elasticsearch decides to do something funky with types in future (in terms of index-encoding), then we have already reserved the right to do that in a non-BWC way with the use of the experimental marker on this feature.

markharwood commented 6 years ago

Had a chat with @jpountz and we agreed that a type system for annotations should come later (hopefully sharing the notion of entity types from metadata also used in keyword fields).

In the interim we said we should reject as malformed any documents that have annotations using the [foo](key=value) syntax rather than the [foo](value) syntax.

OK with this, @colings86 ?

colings86 commented 6 years ago

Sounds good to me 👍