[Change Proposal] Support Knowledge Base Packages

elastic / package-spec

EPR package specifications

Other

18 stars 72 forks source link

[Change Proposal] Support Knowledge Base Packages #693

Closed spong closed 2 days ago

spong commented 8 months ago

Summary

This is a proposal much like #346 or #351 for enabling the ability to bundle static data stream data within a package, such that when the package is installed, the data stream is created, and the bundled data is ingested into the data stream.

The specific use case here is for shipping 'Knowledge Base' content for use by the Elastic Assistants. For example, both Security and Observability Assistants are currently bundling our ES|QL docs with the Kibana distribution for each release. We then take this data, optionally chunk it, and then embed/ingest it using ELSER into a 'knowledge base' data stream so the assistants can query it for their ES|QL query generation features. Each release we'll need to update this content, and ship it as part of the Kibana distribution, with no ability to ship intermediate content updates outside of the Kibana release cycle.

Additionally, as mentioned in https://github.com/elastic/package-spec/issues/346#issuecomment-1890186256, this essentially provides us the ability to ship 'Custom GPTs' that can integrate with our assistants, and so opens up a world of possibilities for users to configure and expand the capabilities of the Security and Observability Assistants.

Requirement Details

Configuration

The core requirement here is for the ability to include the following when creating a package:

Any number of data streams to create, though realistically one is probably sufficient
An arbitrary number of documents, perhaps in json format, or zipped as detailed in #346
- This generally won't be large amounts of data as detailed in #346 (our ES|QL docs are 196 documents and ~125KB), however I would expect some users would push this to enable RAG over larger data sets
Some configuration for the destination data stream of the bundled documents. If we include a raw dump of the documents from ES, perhaps we can use just the _index fields to route them accordingly?

Behavior

Upon installation, the package should install the included data streams, then ingest the bundled documents into their destination data stream. This initial data should stick around for as long as the package is installed. If the package is removed, the data stream + initial data should be removed as well. When the package is updated, it would be fine to wipe the data stream/initial data and treat it as a fresh install. Whatever is easiest/most resilient would be fine for the first iteration here. No need to worry about appending new data on upgrade, or dealing with mapping changes, just delete the data streams and re-install/re-ingest the initial data.

The above would be sufficient enough for us to start bundling knowledge base documents in packages, at which point we could install as needed in support of specific assistant features.

spong commented 8 months ago

I know it's only been a week, but would there be any way to expedite the assessment of this change proposal? We're extremely motivated for this effort on the @elastic/security-generative-ai team, and are happy to help provide resources in putting an MVP together, just let us know -- thanks! 🙂

jen-huang commented 8 months ago

@jsoriano @kpollich The rationale behind this type of package seems sound to me. Anything the GenAI team should consider as part of MVP?

jsoriano commented 8 months ago

An exercise we can do is to create a potential example package, and from it see what we would need to do to support this in our packages ecosystem, we may find that we can add this as a normal feature for packages in the package-spec and Fleet, without needing support for big contents or "DLC" packages.

Later if we find that the size of the data set is a blocker, then we would also need https://github.com/elastic/package-spec/issues/346. And we find that for the same package we may want to have additional knowledge bases, then we may need https://github.com/elastic/package-spec/issues/351.

jsoriano commented 8 months ago

@spong could you provide a potential example package for this?

spong commented 8 months ago

Absolutely @jsoriano! I'll get an example package put together today 🙂

spong commented 8 months ago

Got an initial pass up as a PR to the integrations repo here: https://github.com/elastic/integrations/pull/9007. Still some more I need to read through/update, but this includes the data stream and the raw sample documents to be ingested at least.

I see inference ingest pipelines are supported in the spec, which could be nice for performing the embedding on initial ingest (and so enabling different embedding/chunking strategies), however that would add overhead in dealing with the trained_model dependency (is there an ML node, enough memory, correct model installed, etc). Perhaps there's already support for these scenarios since ML modules are a supported asset type?

jsoriano commented 8 months ago

@spong and I met today over the example package in https://github.com/elastic/integrations/pull/9007, and we have a proposal to move forward:

Package Spec:
- We will add a new folder to the root of integration packages, called knowledge_base.
- This knowledge_base directory can contain multiple knowledge bases.
- Each knowledge base will be defined by a set of documents, and an optional custom mapping for them. The documents can be ready to be ingested, or in a raw form, that needs to be "built", more on this later.
- This new feature will be experimental, intended to iterate.
Fleet:
- When installing the package, it will create a new index or data stream, with the provided documents for each knowledge base. We need to define the convention for the names of these documents.
- Fleet will have a base component template with mappings for these indexes. Each knowledge base can provide additional mappings.
- Each knowledge base is tied to a model (elser_1, elser_2...), ~~Fleet should be able to discover and install only the compatible knowledge bases with the models available in the deployment.~~ Fleet will just install them and assistants will take care of discovering the ones that work with the models they have.
- Permissions should be limited as much as possible so nothing can write to these indexes.
- Fleet will somehow let the assistant know about the knowledge bases installed, explicitly via callbacks, or by convention by known index patterns.
Elastic-Package:
- At some point it would be nice to make it capable to build knowledge bases. For this, it will require a stack with ML capabilities. It will install a pipeline with inference capabilities and will ingest the raw documents with them. It will then export the resulting knowledge base ready to be installed. The input for this process could be included in knowledge_base/_dev.
Use cases:
- Standalone knowledge bases. Installed just to increase the capabilities of the model, for example to teach it ES|QL. These will be distributed in integration packages with knowledge bases but without data streams, similar to some packages we have now with ML models.
- Knowledge bases related to integrations. Installed to teach the model about specific solutions or services. For example to teach the model about apache when installing the apache integration. These will be distributed in existing integration packages, adding the knowledge bases apart of the existing integrations.

spong commented 8 months ago

That looks great @jsoriano! Thanks for meeting with me and distilling the above proposal.

Just want to make a couple notes/clarifications:

Each knowledge base is tied to a model (elser_1, elser_2...), Fleet should be able to discover and install only the compatible knowledge bases with the models available in the deployment.

While true, I'm not sure we need this validation/gating within fleet itself. Models can be uninstalled, upgraded, or re-installed after a KB package has been installed, so the assistants will already need to handle any missing or mis-matched model scenarios when it checks for available/compatible knowledge bases.

Fleet will somehow let the assistant know about the knowledge bases installed.

I don't think a callback on install is needed at this time since assistants will need to query for 'compatible' knowledge bases on their own (as above), but if it works with the existing registerExternalCallback interface then all the better 🙂

jsoriano commented 8 months ago

Each knowledge base is tied to a model (elser_1, elser_2...), Fleet should be able to discover and install only the compatible knowledge bases with the models available in the deployment.

While true, I'm not sure we need this validation/gating within fleet itself. Models can be uninstalled, upgraded, or re-installed after a KB package has been installed, so the assistants will already need to handle any missing or mis-matched model scenarios when it checks for available/compatible knowledge bases.

Ok, I guess this makes things easier for Fleet :slightly_smiling_face: It just installs the knowledge bases and assistants get the ones they can use.

Fleet will somehow let the assistant know about the knowledge bases installed.

I don't think a callback on install is needed at this time since assistants will need to query for 'compatible' knowledge bases on their own (as above), but if it works with the existing registerExternalCallback interface then all the better 🙂

So knowledge bases would be discovered by convention on some index pattern? I am ok with that, in any case this is something we need to think about.

spong commented 8 months ago

So knowledge bases would be discovered by convention on some index pattern? I am ok with that, in any case this is something we need to think about.

Yeah, I'm thinking for this first pass the assistants can do self discovery based on index naming convention, or hitting the fleet API for packages with the tag "Knowledge Base". Then either read further metadata from the fleet manifest like we discussed, or later push that metadata/descriptor state to a document in the knowledge base itself.

spong commented 7 months ago

Slight update here. Didn't have much bandwidth this past week, but I was able to put together a pretty rough POC following the above proposal.

If you're okay with it, I'm happy to round out this POC and push the fleet, package-spec, integrations and elastic-package changes as draft PR's for feedback/collaboration and then we can go from there? If you prefer to manage any of these changes though let me know!

Quick demo video below. User installs package assets, fleet code sets up index/ingests KB docs, and then functionality immediately becomes available within the Assistant

https://github.com/elastic/package-spec/assets/2946766/16769133-257f-45fb-8e52-7b4cc4493443

jsoriano commented 7 months ago

@spong wow, this looks great! Yes, please open PRs and we can continue the discussion there. Thanks!

pgayvallet commented 3 weeks ago

With all the work around the "Unified AI Assistant", and the corresponding initiative for the unification of the knowledge base, the responsibility of maintaining and distributing knowledge base "bits" is somewhat moving to the newly formed @elastic/appex-ai-infra team.

I think @spong did a great job here with his requierements. The "revisited" version from our side is very similar, but just for clarity, I will write it down:

Context

For the Kibana AI assistant(s), what we call "Knowledge base" (or "KB") is, to simplify, a set of sources the assistant can use to retrieve documents related to their current context. For example if the user asks the assistant a question about Kibana's configuration settings, the assistant can search and retrieve from its knowledge base articles/documents in relation with this question / context.

What we want to do

We want to be able to ship KB sources as packages (more specifically, index sources, as there can be different kind of KB sources, but I won't elaborate on that point given that only index sources are relevant here).

A KB source is composed of:

an index or data stream (and its mapping definition / index settings)
- indices would be sufficient for our usecase, so if data stream support is more complex (e.g. for uninstall), we can ditch them from the requierement
an arbitrary number of documents present in / bound to this index or data stream.
(indirectly, but coupled to) a model/inference endpoint ID that will be referenced by the semantic_search type of fields in the mapping (see next section for that point)

Installation behavior is straightforward:

create the index/datastream
then ingest/index the associated document to the index

Uninstall behavior is too:

delete all the assets that were installed by this package

For updates, we would simply follow a uninstall old =>install new workflow (it is ok to purge the indices)

Additional questions

System indices

For KB sources, we're (ideally) planning on using system indices. Would that be an issue with the way package installation works? Which user is being used under the hood, the Kibana internal user?

semantic_text referencing model ids

Just one technical detail worth mentioning - knowledge base ("KB") retrieval is based on semantic search. In practice, it means there is a strong coupling between the index (mapping) and a specific model / inference endpoint.

Which raises the question of how to manage that coupling for packages. We are planning on pre-installing the model that will be used for all our KB sources, but I think we still need a way for the package installer to "check" some install condition - e.g. only allow installation if the model with this specific ID is present, or something similar (as semantic search requires an ML node, meaning that not all cluster will be able to support it - and we should not allow to install the package in that case). I have no idea if that kind of programmatical or scripted checks are possible today, but we will likely need to find a solution

indexing documents in indices not "maintained" by the package

For our specific needs, we would ideally be able to create a document in another index (our KB source listing index) during package installation, to flag the source as being available. It means that during uninstall, we would need to delete this specific document from the said index, without purging it (as it wasn't installed by the package).

That's not a strict requirement though, we should be able to work around if we don't have that.

Which approach should we take

from https://github.com/elastic/package-spec/issues/693#issuecomment-1919620422:

We will add a new folder to the root of integration packages, called knowledge_base

That's only my 2cps, I'm not sure to agree in the "specialized" approach that was discussed in that issue.

I really feel like the right approach would be to be as generic as possible and "simply" make the spec evolve to be able to add ES documents bound to an index to packages. Not to create "content only" packages (https://github.com/elastic/package-spec/issues/351), or to do something absolutely specific such as this knowledge_base folder that was discussed in that issue. "Just" to add ES documents as a supported type for packages.

Now, if the generic approach is significantly more work, I would be very fine with something more specific to our exact need here. I just feel like having content in package could be something that could benefit more than just this exact use case?

spong commented 3 weeks ago

That's a good summary of where we're at and what's needed here, thanks @pgayvallet!

I really feel like the right approach would be to be as generic as possible and "simply" make the spec evolve to be able to add ES documents bound to an index to packages. Not to create "content only" packages (https://github.com/elastic/package-spec/issues/351), or to do something absolutely specific such as this knowledge_base folder that was discussed in that issue. "Just" to add ES documents as a supported type for packages.

I'm also in agreement for going with a more generalized solution. I started with that thought by trying to work with the existing 'sample data' issue, but ended up being directed to a more specialized initial implementation. So if we can make this work with the Content Packages RFC (internal), or something else more generic, all the better.

At the end of the day these packages are just data, with a pre-install requirement for a model/inference endpoint ID (though technically not if the data is already embedded and we're able to target the default deployed model). We don't even need an ingest pipeline anymore either with semantic_text. So an MVP is pretty straightforward. That said, I think there are interesting questions to explore around managing chunking strategies, including 'serialized tools' as assets, and so forth, but wouldn't let those get in the way of delivering on a clean MVP so we can make progress and start getting feedback.

jsoriano commented 3 weeks ago

I really feel like the right approach would be to be as generic as possible and "simply" make the spec evolve to be able to add ES documents bound to an index to packages.

The generic approach sounds good to me, but we still need a way to indicate Fleet to run the pre-install steps, that may be different depending on the type of data. So maybe the approach could be something like having index_data/{name} directories, each one with the data to ingest, the field mappings definitions and some metadata file indicating the type of data, if it should use ES indexes or data streams and so on.

Not to create "content only" packages (#351), or to do something absolutely specific such as this knowledge_base folder that was discussed in that issue.

"content only" packages are not specific to this use case, they are useful in other use cases. I think data distribution will fit better in this kind of package than in "integration" or "input" packages.

pgayvallet commented 1 week ago

I started taking a look at the package-spec and integration repositories, and given what the spec currently supports, I'm more and more leaning toward doing something specific for knowledge base, as @jsoriano proposed initially, rather than something fully generic to allow indexing and removing documents from any arbitrary index, as I suggested in my previous reply. I doubt we will really be able to do something generic to suit everybody's need in term of adding arbitrary documents, so it's probably better to stay humble and focus on our specific need here.

I have a few questions:

1. Storage format for knowledge base documents in the package

I see that Kibana entities (such as dashboards) are stored each in their own individual file, following a kibana/{entityType}/{id}.json filepath pattern.

For KB, we will have large amounts of documents (100's to 1000's) per KB "source", so I'm not sure what the best option would be here:

doing the same than for Kibana entities, and follow a "1 file per document" approach, or
having a single file containing all the documents for a given index (=kb source) instead, either in a ndjson or json array format.

One file per document would result in very large folder contents, but we're still far below the volume where it becomes a problem. One single file containing everything is ihmo more elegant, but then it may lead to other issues (parsing/loading the whole file in memory during installation could be problematic).

I know packages are zipped in the registry, but I'm starting to wonder if using an internal archive for such large amount of documents wouldn't be a smart move. Compressed format have an index, allowing to load entries individually, which would get rid of the memory problems. The downside is that it fully kills diffs by introducing data in a binary format within the package's sources...

So yeah, really not sure what the best approach would be here, insights or opinions are very welcome.

2. Spec changes structure

I see we now have a spec/content folder, with the content type spec relying massively on references toward the integration type spec. Do we assume knowledge bases will ever only be used by content packages (so should I directly add what I need under spec/content, or should I instead allow integration packages to also support the feature (and do as it's done for kibana atm, with spec/content/kibana referencing spec/integration/kibana? Any preferences?

3. Package size

I did a quick test, and the KB source for Kibana 8.15 documentation is around 600 documents, for a total of 45mb (uncompressed) and 12mb (compressed) - yeah, embeddings take a lot of space. And Kibana is one of the smallest source (ES is twice that, security almost 10 times that size - in terms of number of documents at least).

So the question is simple: are we fine adding such large packages to the integrations repository? If not, what would be our alternatives?

jsoriano commented 1 week ago

Storage format for knowledge base documents in the package

We can have a mix of both, with a directory with files, each file potentially containing many documents in ndjson format, and all documents in all files are ingested. For simple use cases a single file will be enough, for more complex use cases the multiple files might help to organize the data and help with maintenance. Multiple files can also be useful to workaround size limits in repositories.

Regarding memory usage, it doesn't have to be necessarily an issue, as the file could be downloaded to disk and the files streamed to ES as needed, avoiding to have the package or the data in memory. This needs some work in Kibana/Fleet though, but I think this is an effort we should to do in any case to optimize usage of resources.

I wouldn't go in any case with the approach of a single document per file, I don't see any advantage on this. We don't need to follow the approach used for kibana assets.

Spec changes structure

Don't worry too much about this. I would start by adding this only to content packages, and if needed in other packages in the future we can reorganize the files. We also plan to use content packages to test a new installation code path for big packages.

Package size

We have options here depending on how these packages are going to be managed. For example if they have big files, but they don't change a lot, I think they are fine in the integrations repository. If they have really big files, we might try Git LFS.

are we fine adding such large packages to the integrations repository? If not, what would be our alternatives?

Our tooling and infra supports having packages on different repositories, this is something we have been doing mainly for organisational reasons. If we feel that these packages are going to have special needings, we could have a different repo for them, or even one repository per package.

pgayvallet commented 1 week ago

I opened https://github.com/elastic/package-spec/pull/807 with my spec update proposal