[KB] Add documentation packaging workflow

pgayvallet commented 1 week ago

For https://github.com/elastic/kibana/issues/192031, we need to have a CI task or workflow that would

Retrieve the subset of documentation articles we are interested in from the innovation team's cluster
(Unless we decide to re-use their embeddings) Generate embeddings for it
Re-export the documents with their embeddings
Build the fleet package containing those documents and the corresponding index creation instructions

Embedding generation could be done by indexing the documents in some cluster with the fields we want embeddings for as semantic_text, wait for the embedding generation to be complete and then re-export the documents for the next steps.

The last step is the one that is unclear to me - I'm not sure atm how exactly fleet packages are being built and added to the package registry / images.

pgayvallet commented 3 days ago

I created a POC (https://github.com/elastic/kibana/pull/193847) to show what the documentation extraction script would be in charge of doing.

What the script does:

connect to the source cluster containing the documentation, and extract the subset that we are interested in
setup an index with the right mappings on a local cluster and index the documentation there, generating the embeddings
store the documents with embeddings on disk, on json format.

I tried with the Kibana 8.15 documentation, which is ~600 files, and the zipped output is around 12mb. I'd say that most of it is coming from the embeddings.

I also tested the semantic search based documentation retrieval, which seems to be doing okay, E.g

search term: 'How to enable TLS for Kibana?'

top 3 results:
- Encrypt TLS communications in Kibana | Kibana Guide [8.15] | Elastic
- Security production considerations | Kibana Guide [8.15] | Elastic
- Mutual TLS authentication between Kibana and Elasticsearch | Kibana Guide [8.15] | Elastic

See the performSemanticSearch function of the PR for details.

pgayvallet commented 3 days ago

I think we will need to progress on https://github.com/elastic/kibana/issues/193849 before progressing further on the current issue, as we need more clarity on what the exact format will be for our "KB packages" and their documents.

elastic / kibana

[KB] Add documentation packaging workflow #193473