Open leehinman opened 3 years ago
Hmmm, I hope I'm understanding the use case correctly but if I am, I wonder if the recently-merged input groups change to the package spec might be of use here. The PR that implemented it (and related discussion) is https://github.com/elastic/package-spec/pull/137 but you can also quickly see the idea in action in this sample package: https://github.com/elastic/package-spec/blob/714ec42dce47c3dafceba677a48bed6d05c175dd/test/packages/input_groups/manifest.yml#L37-L39. Note that this change hasn't been fully rolled out yet; the rollout is being tracked in https://github.com/elastic/package-spec/issues/144.
maybe, one thing I'm proposing is that the apache pipeline stays in the apache package(no duplicate in meta package), in the REST API meta package, user chooses "apache.access" from a list and data is then sent to that data stream and so the apache pipeline gets run on that data. Also no duplicating of dashboards. I don't understand the "input group" change well enough to see if we get that behavior.
I think the two efforts are not related. If I understand @leehinman the above request mainly to "generic" inputs where multiple data sources come in at the same time. I think this is also related to the discussion around specifying the same input multiple times? So instead of having to set the dataset
manually, the UI should have a drop down to select an existing dataset that already exists?
Yes this is definitely related to adding the same input multiple times and I think #110 will address that part.
The other part is how do we allow users to pick the input type without a large amount of duplication. For example Apache access logs could be coming from the log, httpjson, kafka or syslog input. We could add each input to every package, but then you get an interface that looks like elastic/integrations#545 For that screenshot the logic to pull the data from the REST API and populate the message field was duplicated for all 4 packages. It also means that the user has to enter the information to connect to the REST API in each package, which is a lot of duplication and a pain when the password needs to be updated.
I'm hoping we can come up with a solution for inputs like httpjson,
kafka, syslog & Windows Event logs where multiple types of data could
be in the datastore that is accessed by the input. From a
configuration standpoint it would be nice to configure the basic
connection information for the input once. For example the REST API
it might be hostname, port, username & password. Then for each kind
of data we would have some way of getting just the data we want from
the datastore. For REST API that would be search, for kafka a topic,
etc. And then for each kind of data we should map it to a
data_stream.dataset
. If thedata_stream.dataset
is normally setup
(pipeline/dashboards/fields) by another package we need to track that
dependency. The reason for sending to a known data_stream.dataset
is so we don't have to duplicate the dashboards & pipelines.
@sorantis Can you chime in here?
@mukeshelastic any chance you could comment on how you think we should handle packaging for third-party REST API, kafka, syslog, etc. ?
@leehinman the input group
provides the ability to combine related data streams together (it's shown on see from the many examples in the granularity doc). This way the integration developer can combine all logs related data streams in one group called Logs (or multiple should there be a need to separate Operational Logs from Security Logs).
Following the proposed structure for integration packages, all these different inputs can be either combined under an input group or they can each represent an integration policy template.
There's an example of this new structure based AWS package. In this example a data stream is assigned explicitly to an input.
cc @ycombinator @mtojek @kaiyan-sheng
For REST APIs we would really like to have a single meta package where we can define an input multiple times and in that input redirect the data to a different
data_stream.dataset
. For example if the REST API contained both Apache Access and AWS Cloudtrail data two inputs would be defined. In addition to the variables needed to connect to the REST API and collect the data, each input would also have an option to select thedata_stream.dataset
from a list of availabledata_stream.dataset
s This would allow you to send the Apache Access data from the REST API to theapache.access
dataset and take advantage of the ingest node processing available. It would also allow you to take advantage of the Apache Dashboards.This new field would signal to Kibana to display a list of available
data_stream.dataset
s for the user to pick from. Also if a dataset is selected Kibana would need to "install" that package so that the ingest pipelines and dashboards are available.This could be useful for kafka and syslog as well.