[ Automatic Import ] Support reading samples from an API endpoint

bhapas commented 1 week ago

Summary

Currently Automatic Import relies on reading samples from the uploaded file from end user.

This issue focuses on adding possibilities to read the samples from an API endpoint from customer instead. The feature can be used as an alternative for uploading sample files.

This may also provide us with opportunity to read as many samples as we need to avoid mishandling large amount of sample data.

This is currently a discuss issue and can be improved with sub-issues.

elasticmachine commented 1 week ago

Pinging @elastic/security-scalability (Team:Security-Scalability)

jamiehynds commented 5 days ago

During UX discussions with @kgeller and @einatjacoby, we brainstormed the workflow to auto-generate the CEL program to ingest events via an API, based on a user-provided OpenAPI spec file.

We originally explored the idea of following the existing Auto Import flow, but adding an additional step after pipeline creating, to configure the CEL input. However, we've already seen requests to ingest sample data via an API, rather than uploading a sample log file. This is a good opportunity to get the UX right for this flow, as CEL will be part of this.

Totally open to ideas, but a possible idea:

Select the CEL input from existing Auto Import UI.
Option to upload an OpenAPI spec file and use an LLM to build the CEL program (should users have the ability to select which LLM to use here?)
CEL program is built and users can ingest data to be used as sample data. Should we limit this based on number of events or a timeframe? Can a user validate connection to the API during this step?
We present analysis of the sample data - is there enough sample data and are there enough varying types of events?
User then follows the renaming Automatic Import steps, the final integration will include the CEL program to start ingesting data right away.

bhapas commented 5 days ago

@jamiehynds

Some followup questions to the workflow..

Select the CEL input from existing Auto Import UI.

It is possible to multi-select the inputs here. Do we want to do this workflow if CEL is one of the inputs selected?

Option to upload an OpenAPI spec file and use an LLM to build the CEL program (should users have the ability to select which LLM to use here?)

The LLM connector will be selected in the previous step, May be we can reuse the connector , unless user wants to use a different connector for CEL program?

CEL program is built and users can ingest data to be used as sample data. Should we limit this based on number of events or a timeframe? Can a user validate connection to the API during this step?

This is something that needs to be thought of. At this point there is no integration nor input configuration defined , how is the data ingested just using CEL program? We don't have access to mito in Kibana as well.

jamiehynds commented 5 days ago

It is possible to multi-select the inputs here. Do we want to do this workflow if CEL is one of the inputs selected?

Yes, but ensure it's an optional step. There may be cases where a user can't locate an OpenAPI spec and therefore can't rely on an LLM to build the CEL program.

The LLM connector will be selected in the previous step, May be we can reuse the connector , unless user wants to use a different connector for CEL program?

Yes, I was thinking use the connector selected at the start of the Auto Import flow by default, but provide the option to use another LLM. I'm on the fence as to whether this makes sense or not - probably an unknown until we understand whether some models are more suited to mappings data and another excels with CEL. WDYT? If it doesn't make sense, we just use the connector selected at the start of the process, and not provide an option to select an alternative for CEL.

This is something that needs to be thought of. At this point there is no integration nor input configuration defined , how is the data ingested just using CEL program?

We don't have access to mito in Kibana as well. Will rely on some Engineering discussions to figure this one out. We obviously want to avoid a situation where AI builds a CEL program but it's pot luck as to whether it works or not.

bhapas commented 4 days ago

Yes, I was thinking use the connector selected at the start of the Auto Import flow by default, but provide the option to use another LLM. I'm on the fence as to whether this makes sense or not - probably an unknown until we understand whether some models are more suited to mappings data and another excels with CEL. WDYT? If it doesn't make sense, we just use the connector selected at the start of the process, and not provide an option to select an alternative for CEL.

@kgeller Did you observe any difference in the way different LLMs output differently for CEL program? In any case , it would be straight forward to reuse the connector selected at the start unless we find a definite difference in the performance.

Yes, but ensure it's an optional step. There may be cases where a user can't locate an OpenAPI spec and therefore can't rely on an LLM to build the CEL program.

Ok. Does this mean we leave the program part empty and let user fill in the CEL program when deploying the integration to the agent?

We don't have access to mito in Kibana as well. Will rely on some Engineering discussions to figure this one out. We obviously want to avoid a situation where AI builds a CEL program but it's pot luck as to whether it works or not.

I know there were discussions about having a web assembly version of mito. @efd6 or @andrewkroh might have a better opinion here. Otherwise we should build an inbuilt testrunner for running CEL programs and verify if the events can be separated out.

andrewkroh commented 4 days ago

I know there were discussions about having a web assembly version of mito. @efd6 or @andrewkroh might have a better opinion here Otherwise we should build an inbuilt testrunner for running CEL programs and verify if the events can be separated out.

There are limitations to what standard WebAssembly programs can do when run server-side in Node.js. The WASM code runs in a sandbox and is fully isolated, meaning it cannot make web requests. However, WebAssembly would be suitable for tasks that can run in isolation, such as:

verifying that a CEL problem compiles
formatting CEL code based on celfmt

If we need to execute the CEL program and have it make API requests, then we'll require a different solution.

kgeller commented 4 days ago

Yes, I was thinking use the connector selected at the start of the Auto Import flow by default, but provide the option to use another LLM. I'm on the fence as to whether this makes sense or not - probably an unknown until we understand whether some models are more suited to mappings data and another excels with CEL. WDYT? If it doesn't make sense, we just use the connector selected at the start of the process, and not provide an option to select an alternative for CEL.

@kgeller Did you observe any difference in the way different LLMs output differently for CEL program? In any case , it would be straight forward to reuse the connector selected at the start unless we find a definite difference in the performance.

I only have tested so far with Sonnet and Opus, but I have observed differences in output between the two. Sonnet has been much more consistent in output, and has better accuracy with the smaller details (ex including a '?' before the query parameters when building the request URL). I did my best to write the prompt for Opus to close those gaps. But overall, I would say they can produce different results.

I don't think it is that difficult to allow users to select a different model prior to running CEL since I have already split out the flows. It's just a matter of determining the UX for that since:

Yes, but ensure it's an optional step. There may be cases where a user can't locate an OpenAPI spec and therefore can't rely on an LLM to build the CEL program.

Ok. Does this mean we leave the program part empty and let user fill in the CEL program when deploying the integration to the agent?

Yes, that's what we discussed. The user wouldn't upload the file and we would keep the functionality that exists today of the empty custom CEL input config in the final output.

bhapas commented 3 days ago

If we need to execute the CEL program and have it make API requests, then we'll require a different solution.

@andrewkroh May be a combination of both here. We would want the CEL program to compile and formatted and then make API requests to pick samples / ingest samples into a temp index which shall later be picked up by Auto Import to build the integration's ingest pipeline.

Not sure any solution in Kibana does make an external API call today and need to look at the implications around that.

andrewkroh commented 3 days ago

May be a combination of both here

I fully agree. Another benefit beyond CEL validation for LLM is that if we add WASM to run celfmt this could also be reused by the Custom CEL input integration to provide validation as the user writes their program.

For executing the collection from an API we need to list out of the possible ways this can be implemented and evaluate the feasibility. For example, one option might be to send an action down to an agent to perform a one time execution and return the data.

elastic / kibana

[ Automatic Import ] Support reading samples from an API endpoint #193514

Summary