[Obs AI Assistant] Make content from Search connectors fully searchable

miltonhultgren commented 7 months ago

Today, if we ingest a large piece of text into a Knowledge base entry, only the first 512 word pieces are used for creating the embeddings that ELSER uses to match on during semantic search.

This means that if the relevant parts for the query is not that the "start" of this big text, it won't match even though there may be critical information at the end of this text.

We should attempt to apply chunking to all documents ingested into the Knowledge base so that the recall search has a better chance of finding relevant hits, regardless of their size.

As a stretch, it would also be valuable if it was possible to extract only the relevant chunk (512 word pieces?) from the matched document in order to send less (and only relevant) text to the LLM.

AC

Large texts imported into the Knowledge base get embeddings that cover the full text
The Ingest pipeline used to apply the chunking is shared in docs so users can apply it to their search-* indices as well
Recall is able to search across small Knowledge base documents ("single" embedding) and large documents ("multiple" embeddings) in a seamless manner
(Stretch) Only the relevant part of a "multiple embeddings" document is passed to the LLM

More resources on chunking https://github.com/elastic/elasticsearch-labs/tree/main/notebooks/document-chunking

elasticmachine commented 7 months ago

Pinging @elastic/obs-knowledge-team (Team:obs-knowledge)

miltonhultgren commented 7 months ago

If we want to retrieve multiple passages from the same text document, we need to split them before ingesting them and store 1 document per passage. The recommended chunk size for ELSER is 512 but to make the search more coherent it's also recommended to overlap the chunks by 256 tokens.

dgieselaar commented 7 months ago

If we want to retrieve multiple passages from the same text document, we need to split them before ingesting them and store 1 document per passage.

Do you mean that we can only select a subset of passages if we split them up into separate documents?

miltonhultgren commented 7 months ago

Yes, at least that is my understanding after talking to the AI Search folks.

Assuming you have a large document, and you create nested fields of each passage and create embeddings for each passage. You'll be able to use knn with inner_hits to search across all passages but it will still give back the whole document (and perhaps some information about which passage caused the match), but you can't pull out more than one passage this way (even setting the k value of the knn to higher, that will just give you more whole document hits with a single passage).

So to get multiple passage hits we need to store multiple documents in ES, which would then let us turn up the k value in our search to find possibly multiple hits from the same original large document text. Not sure if semantic_text would change this.

miltonhultgren commented 6 months ago

Do you mean that we can only select a subset of passages if we split them up into separate documents?

@dgieselaar The thing I said above is true for using knn (I've asked if this will change at some point), but if you're using ELSER you cannot use knn (dense vector vs sparse vector), so you need to stick to text_expansion queries which also support inner_hits but in this case can give back more than 1 hit.

So as long as we use ELSER (or rather some model that produces sparse_vector) for the chunking, we can search across a large document and return X number of passages in that document that matched.

Example query:

GET wiki-dual_semantic*/_search
{
  "query": {
    "nested": {
      "path": "passages",
      "query": {
        "text_expansion": {
          "passages.sparse": {
            "model_id": ".elser_model_2_linux-x86_64",
            "model_text": "Where is the Eiffel Tower?"
           }
        }
      },
      "inner_hits": {
        "_source": false,
        "size": 5,
        "fields": [
        "passages.text"
      ]
     }
    }
  },
  "_source": false,
  "fields": [
    "title"
  ]
}

Pseudo query for multi model hybrid search:

GET my-index/_search
{
query: {
  bool: {
    should: [
      { text_expansion }, // on nested field1, with inner_hits
      { text_expansion }, // on nested field2, with inner_hits
      { match_phrase }, // on nested field3
    ]
  }
}.
knn: [
  {
    "field": "image-vector",
    "query_vector": [-5, 9, -12],
    "k": 10,
    "num_candidates": 100
   // with inner_hits
  },
  {
    "field": "image-vector",
    "query_vector": [-5, 9, -12],
    "k": 10,
    "num_candidates": 100
    // with inner_hits
  }
],
 "rank": {
        "rrf": {
            "window_size": 50,
            "rank_constant": 20
        }
    }
}

dgieselaar commented 6 months ago

@miltonhultgren that sounds good AFAICT, do you see any concerns?

miltonhultgren commented 6 months ago

KNN supports multiple inner hits in 8.13 🚀

I haven't gotten to really trying these things out yet. It seems the path is being paved for us here (and semantic_text will only make it easier). A lot of the things I've looked at are out of scope for this issue and will be things we can plan for future iterations.

For this issue I will stick to using ELSER, chunking into a nested object, using a nested query with text_expansion and inner_hits to grab multiple relevant passages.

I have two small concerns for this ticket:

Should we aim to support keyword/hybrid search (using a normal text match BM25 query with/without RRF)?
I'm not sure I fully understand how to apply the chunking yet, in particular the "512 size, 256 overlap"]

Number 1 would be in case, for example, there isn't any embeddings in a search-* index or there are only dense_vector embeddings, we could still fallback on keyword search and maybe find good matches that way. That could also allow users to use our Knowledge base without ELSER installed. I'm leaning towards deferring that until later though (together with multi model support), do you agree @dgieselaar ?

I'm going to research number 2 next.

miltonhultgren commented 6 months ago

Sample query combining nested query match and inner_hits with knn and inner_hits sorted with RRF:

GET wikipedia_*/_search
{
  "size": 5,
  "_source": false,
  "fields": [
    "title",
    "passages.text"
  ], 
  "query": {
    "nested": {
      "path": "passages",
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "passages.text": "who is batman"
              }
            }
          ]
        }
      },
      "inner_hits": {
        "name": "query",
        "_source": false,
        "fields": [
          "passages.text"
        ]
      }
    }
  },
  "knn": {
    "inner_hits": {
      "name": "knn",
      "_source": false,
      "fields": [
        "passages.text"
      ]
    },
    "field": "passages.embeddings",
    "k": 5,
    "num_candidates": 100,
    "query_vector_builder": {
      "text_embedding": {
        "model_id": "sentence-transformers__all-distilroberta-v1",
        "model_text": "who is batman"
      }
    }
  },
  "rank": {
    "rrf": {}
  }
}

miltonhultgren commented 6 months ago

Would it be desired/ideal to perform a single ranked search across text, dense and sparse vectors but also across all indices at once? Rather than per source (knowledge base, search connectors in different indices)? What are the trade offs for that?

How would one combine that with "API search", meaning searches that hit an API rather than Elasticsearch? Just thinking out loud here for the future.

dgieselaar commented 6 months ago

@miltonhultgren yes it would be preferable (a single search), but we have different privilege models for the knowledge base versus search-* - the former uses the internal user, and the latter uses the current user, so we cannot (at least to my understanding) execute it as a single search request.

miltonhultgren commented 4 months ago

We're waiting for semantic_text to be available since it will handle chunking for us, at that point this ticket can be re-written to reflect the work needed to migrate the Knowledge base to use semantic_text instead.

sorenlouv commented 3 months ago

Update: This is still blocked by semantic_text

elastic / kibana

[Obs AI Assistant] Make content from Search connectors fully searchable #175434

AC