elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
12.16k stars 4.91k forks source link

Setting host.* in Beats that forward data #13920

Open andrewkroh opened 5 years ago

andrewkroh commented 5 years ago

There are several use cases in Beats where the data reported by a Beat did not originate on that Beat host. Some examples are syslog, windows forwarded events, router netflow data, and cloud watch logs. In these cases it would be appropriate to set the host.* field to information about the originating machine.

From ECS:

ECS host.* fields should be populated with details about the host on which the event happened, or from which the measurement was taken.

Some issues related to this:

I think we need way for inputs and modules to be able to "designate" that host.* should not be set by default. The output pipeline and also the add_host_metadata processor will need to honor this "designation".

andrewkroh commented 5 years ago

@webmat Would it be appropriate to populate observer.* in any of these above use cases? If so which host's data would go there?

webmat commented 5 years ago

Here are cases that should populate observer:

Here are cases that should not populate observer:

In both of these cases, if the data stream contains the source host's detail, it should to go to host.*.

webmat commented 5 years ago

The case of monitoring containers may require a chat, I see a few cases, there may be more:

I'm not sure the current semantics as defined by ECS are clear nor sufficient to fully capture this. Maybe I'm wrong. But I'd be more than happy to have a chat with folks specialized with monitoring Docker, and hash out ideas here. Let me know if that would be useful.

missnebun commented 5 years ago

I just installed filebeat 7.4 and I am using the netflow module. I have the same problem. agent.hostname is the agent.hostname for the hostname of netflow input server instead of the sender. I had to update some visualization for Kibana and replace agent.hostname with observer.ip.

urso commented 5 years ago

It is not only the processors, but also libbeat directly adding some fields. See: https://github.com/elastic/beats/blob/master/libbeat/publisher/processing/default.go#L78

I'm in favor of not automatically modifying any event once it hits libbeat. All modification should be opt-in via processors.

exekias commented 5 years ago

I think we need way for inputs and modules to be able to "designate" that host.* should not be set by default. The output pipeline and also the add_host_metadata processor will need to honor this "designation".

I totally agree, we have this same problem with cloud metadata and cloud monitoring modules (AWS, Azure, etc). Something we have with add_cloud_metadata is that it won't override any info sent by the input/module itself:

https://github.com/elastic/beats/blob/1fb41ea3d464ba55317fc2201183f7ad80e6429a/libbeat/processors/add_cloud_metadata/add_cloud_metadata.go#L112-L117

In this case, cloud metadata for the agent is not sent, which may not be ideal (it could make sense to have it under observer.cloud?)

Something like this could make sense for add_host_metadata, where it could decide to put the metadata somewhere else (as you said, maybe observer)

The case of monitoring containers may require a chat, I see a few cases, there may be more:

  • An agent can be installed on the host running Docker directly (not using Docker)

    • It can then monitor containers
    • It can monitor the host itself
  • An agent can be installed as a sidecar container, with the purpose of monitoring other containers local to the host
  • An agent can be installed in a container, with the purpose of monitoring the host

I'm not sure the current semantics as defined by ECS are clear nor sufficient to fully capture this. Maybe I'm wrong. But I'd be more than happy to have a chat with folks specialized with monitoring Docker, and hash out ideas here. Let me know if that would be useful.

Happy to participate in this conversation @webmat!

urso commented 5 years ago

To me it feels like the overall problem is that we have a limited set of namespaces, but yet there is potentially a trail of subsystems an event might have been passed through. Ultimately one might want an array of system descriptors the event has passed.

For now I want to solve the issue at hand, as this has come up a few times already. The host and agent fields are always overwritten. Currently host and other fields are enforced, potentially overwriting fields no matter where values come from. This is not really ECS problem itself, but a general beats one, as we also mess with users already using ECS for their own data.

Overall ECS discussions are maybe better handled in the github.com/elastic/ecs repository.

For example in filebeat we introduced a kafka input, to allow architectures like Beats->Kafka->Filebeat->Elasticsearch. The problem becomes even more apparent in this situation. In the simplest case Beats would be the host, and the second Filebeat should be the collector. We also pass some fields via @metadata from the inputs like pipeline name, document id, or index name. Once the event reaches the 'collector' filebeat, it should be treated by default as if Beats->Elasticsearch has been configured directly. And this is currently not the case.

We also do not want to introduce "heavy" breaking changes in 7.x. So for 7.x I'm planning these changes (independent of ECS):

For 8.x I would like to remove setting host, agent or observer from within libbeat. Libbeat should not enforce fields, but allow solutions/users to opt-in. All these fields are already available via processors, and I'd prefer to provide default configs with these being enabled. Moving more functionality to processors and removing default behavior will also simplify the setup in libbeat itself.

I will create issues for individual tasks, if we agree on the plan.

webmat commented 5 years ago

This plan makes a lot of sense. Thanks for putting this together, @urso!

simitt commented 4 years ago

FWIW this also concerns APM, we are setting observer information.

andrewkroh commented 4 years ago

@urso, regarding an overwrite option in the processors, you wrote:

If false, the namespaces are protected. e.g. if host is already available in the event, the processor will not add any host.X fields.

This sounds good but I don't think it cannot work as long as libbeat continues to set host.name because then add_host_metadata would never get added because host would always exist. We could move setting the host.name field into the add_host_metadata processor. I think that would address this problem, but would probably cause some level of breaking change (haven't thought through all the consequences yet). WDYT?

For reference this is the code that adds host.name followed by where it runs the global add_host_metadata:

https://github.com/elastic/beats/blob/61fe9fc533bc86af0cb0486a3740bf68d96556d5/libbeat/publisher/processing/default.go#L311-L317

elasticmachine commented 4 years ago

Pinging @elastic/integrations-services (Team:Services)

urso commented 4 years ago

Good point. Maybe it would be better to apply the 'builtin' fields after the local and global processors have been run. When applying builtins, the builtins must not overwrite existing fields, but can add missing fields to namespaces. I think this is a change we can do in 7.x. Although it is a change in behavior, we might consider the current behavior a bug, and the change a fix :)

andrewkroh commented 4 years ago

I'm afraid that if we move the builtins after the globals that we will break existing pipelines. There are users with drop_fields in their pipelines to remove builtins.

My proposal is that we implement your plan above in master targeting 8.0. Then in order to avoid breaking changes in 7.x, we use a bit of event metadata to influence the later stages of the pipeline's handling of host metadata. See https://github.com/elastic/beats/pull/17919 which would address the pressing need of omitting host when forwarding.

andrewkroh commented 4 years ago

7.x Changes

We should implement the changes detailed above by @urso for 8.0. In an effort to minimize breaking changes to 7.x I am making some changes to address the issue without affecting the default behavior users are currently expecting. This is mostly accomplished through updating the example configurations to demonstrate using tags to disable add_host_metadata and by adding config options where necessary to disable host.name in libbeat.

MarcusCaepio commented 3 years ago

Hi all, just want to mention, it is also a problem with all the cisco modules: https://github.com/elastic/beats/issues/14933

elasticmachine commented 2 years ago

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)