V2: Elastic agent monitoring data stream names should not be per process

cmacknz commented 1 year ago

The changes in https://github.com/elastic/elastic-agent/pull/1702 changed the naming convention of the data streams used for sending agent process logs to Fleet to match how the processed are modelled in V2.

This has resulted in monitoring logs for spawned processes failing to be indexed in Elasticsearch when monitoring is enabled in Fleet. The root cause is that Fleet hard codes the list of expected monitoring data streams and uses them when adding index permissions into the agent policy. The result is that the agent has no permission to write to data streams using the new naming convention.

Example errors:

action [indices:admin/auto_create] is unauthorized for API key id [...] of user [elastic/fleet-server] on indices [metrics-elastic_agent.log_default-default], this action is granted by the index privileges [auto_configure,create_index,manage,all]

action [indices:admin/auto_create] is unauthorized for API key id [...] of user [elastic/fleet-server] on indices [metrics-elastic_agent.system/metrics_default-default], this action is granted by the index privileges [auto_configure,create_index,manage,all]

The existing monitoring data streams are defined in Fleet at https://github.com/elastic/kibana/blob/c5f20721e1879f1ebe161b0fa3b207f10ed2b6f7/x-pack/plugins/fleet/common/constants/agent_policy.ts#L15-L28

Monitoring permissions are generated here using this list: https://github.com/elastic/kibana/blob/19413b7daae983b95dbb9f5c7b39cb8f3578ebfa/x-pack/plugins/fleet/server/services/agent_policies/monitoring_permissions.ts#L21-L52

We need to either modify Fleet to generate index permissions for the new data stream names or we need to update the agent to use the index names expected in v1 (effectively they match the binary name).

Note that logs for the agent itself can successfully be sent to Fleet.

kpollich commented 1 year ago

Could you confirm the list of new indices we'd need permissions for here? I'm happy to update the hardcoded list in Fleet. What I see so far from the error messages includes:

// Agent v2 datasets
'elastic_agent.filestream_monitoring',
'elastic_agent.log_default', // Why the `_default` here?
'elastic_agent.system',

It's probably helpful to look at Fleet's snapshot tests for these permissions as well. The full possible set of of permissions is

"logs-elastic_agent-testnamespace123",
"logs-elastic_agent.elastic_agent-testnamespace123",
"logs-elastic_agent.apm_server-testnamespace123",
"logs-elastic_agent.filebeat-testnamespace123",
"logs-elastic_agent.fleet_server-testnamespace123",
"logs-elastic_agent.metricbeat-testnamespace123",
"logs-elastic_agent.osquerybeat-testnamespace123",
"logs-elastic_agent.packetbeat-testnamespace123",
"logs-elastic_agent.endpoint_security-testnamespace123",
"logs-elastic_agent.auditbeat-testnamespace123",
"logs-elastic_agent.heartbeat-testnamespace123",
"logs-elastic_agent.cloudbeat-testnamespace123",
"metrics-elastic_agent-testnamespace123",
"metrics-elastic_agent.elastic_agent-testnamespace123",
"metrics-elastic_agent.apm_server-testnamespace123",
"metrics-elastic_agent.filebeat-testnamespace123",
"metrics-elastic_agent.fleet_server-testnamespace123",
"metrics-elastic_agent.metricbeat-testnamespace123",
"metrics-elastic_agent.osquerybeat-testnamespace123",
"metrics-elastic_agent.packetbeat-testnamespace123",
"metrics-elastic_agent.endpoint_security-testnamespace123",
"metrics-elastic_agent.auditbeat-testnamespace123",
"metrics-elastic_agent.heartbeat-testnamespace123",
"metrics-elastic_agent.cloudbeat-testnamespace123",

blakerouse commented 1 year ago

@kpollich Is it possible to use wildcards? As there is a lot of possibilities and the new data_streams are based on what is defined in the policy.

It is defined as:

elastic_agent.%{input_type}_%{output_name}.%{monitoring_namespace}

Also how is this handled on upgrade? Will it result in a new policy revision? As existing keys will need to be upgraded, as well as not removed the older data-streams as pre-8.6 will still push to those older data-streams.

kpollich commented 1 year ago

Is it possible to use wildcards? As there is a lot of possibilities and the new data_streams are based on what is defined in the policy.

Maybe? I don't really know how these permissions get handled once they're written to the agent policy. The Fleet code we're looking at here generates this chunk of an agent policy

output_permissions:
  default:
    _elastic_agent_monitoring:
      indices:
        - names:
            - logs-elastic_agent.apm_server-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - metrics-elastic_agent.apm_server-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - logs-elastic_agent.auditbeat-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - metrics-elastic_agent.auditbeat-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - logs-elastic_agent.cloudbeat-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - metrics-elastic_agent.cloudbeat-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - logs-elastic_agent-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - metrics-elastic_agent.elastic_agent-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - metrics-elastic_agent.endpoint_security-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - logs-elastic_agent.endpoint_security-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - logs-elastic_agent.filebeat-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - metrics-elastic_agent.filebeat-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - logs-elastic_agent.fleet_server-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - metrics-elastic_agent.fleet_server-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - logs-elastic_agent.heartbeat-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - metrics-elastic_agent.heartbeat-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - logs-elastic_agent.metricbeat-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - metrics-elastic_agent.metricbeat-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - logs-elastic_agent.osquerybeat-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - metrics-elastic_agent.osquerybeat-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - logs-elastic_agent.packetbeat-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - metrics-elastic_agent.packetbeat-default
          privileges:
            - auto_configure
            - create_doc

I think Fleet Server parses these permissions when it generates API keys for agent? If that's correct then wildcards should be supported as far as I know: https://www.elastic.co/guide/en/elasticsearch/reference/current/security-api-put-role.html#security-api-put-role-request-body

If I understand correctly, the reason we don't use wildcards today is a security precaution to prevent arbitrary access to Elasticsearch by agents. Whether that holds up in this case or not, I'm not sure. @joshdover can probably help out a bit here with some of the gaps in my knowledge.

Also how is this handled on upgrade? Will it result in a new policy revision? As existing keys will need to be upgraded, as well as not removed the older data-streams as pre-8.6 will still push to those older data-streams.

Yes we'd need to migrate all policies to the new version that includes any new permissions, which would generate a new revision.

cmacknz commented 1 year ago

Just adding another note here, in the Fleet agent details page there is a view for the logs that includes a way to filter by monitoring data stream. I think these are just prepared queries using the hard coded list of data streams that exist today. This would need to be adjusted in the UI as well. If you click through to the Kibana logs view via the "Open in Logs" link you can filter on whatever you want though.

cmacknz commented 1 year ago

I am increasingly convinced we should use the existing data streams for monitoring data in 8.6 and deal with migrating to the format we want to use afterwards in 8.7. The hard coded data streams will cause problems as we add new input types to monitor, I'm just not convinced there is a way to remove the hard coded names in the time we have left in 8.6

kpollich commented 1 year ago

+1 to @cmacknz suggestion - I think there are enough side effects to track down as a result of these hardcoded Fleet values that it'd be best if we could solve this on the agent side for 8.6, then iterate in 8.7.

joshdover commented 1 year ago

If I understand correctly, the reason we don't use wildcards today is a security precaution to prevent arbitrary access to Elasticsearch by agents. Whether that holds up in this case or not, I'm not sure. @joshdover can probably help out a bit here with some of the gaps in my knowledge.

Yes, in theory ES should accept wildcards in the index privilege list here and it should work. This would need to be tested though.

I think these are just prepared queries using the hard coded list of data streams that exist today. This would need to be adjusted in the UI as well. If you click through to the Kibana logs view via the "Open in Logs" link you can filter on whatever you want though.

I don't think that's the case. IIRC we do terms enum query on that data_stream.dataset field on logs-elastic_agent* to populate this list. Ingesting the data into a new data stream would be feasible. We also would need to update the elastic_agent integration package for all the new stream names. Ideally, we'd switch to using the dataset_is_prefix option that APM uses to have a wildcard on the index template naming scheme. This way we don't need to add new templates for every new agent process.

That said, I'm +1 to use the existing names if it all possible for this release.

blakerouse commented 1 year ago

I am happy to revert 8.6 only to use old names, but 8.7 needs to use the new names for the data streams.

Wildcarding is the best path forward, I believe the Elastic Agent being able to write-only to the logs-elastic_agent.* is all that we need. That allows the monitoring namespace to be change, new spawned components and new integrations without having to generate a new API key just to ingest monitoring logs from the Elastic Agent.

blakerouse commented 1 year ago

This UI element doesn't seem hard-coded as I have mine showing the new stream names when I have my elastic-agent injecting with superuser permissions.

blakerouse commented 1 year ago

Wildcard for metrics-elastic_agent.* will also be needed.

cmacknz commented 1 year ago

Wildcarding is the best path forward, I believe the Elastic Agent being able to write-only to the logs-elastic_agent.* is all that we need.

Agreed, let's fix this for 8.6 and then open a separate issue summarizing the changes needed for 8.7. I have this issue listed as a blocker for the 8.6 release so we'll want to close it once 8.6 is fixed.

blakerouse commented 1 year ago

Being this issue describes what is needed for 8.7, I would rather leave this one open and create a new one just for 8.6.

cmacknz commented 1 year ago

Makes sense, I created https://github.com/elastic/elastic-agent/issues/1847 for 8.6 I'll update it as the tracking issue everywhere.

joshdover commented 1 year ago

While I agree that per-process data streams doesn't fit the long-term architecture we want to provide (where the processes are implementation details and not something the user should know or care about), I think we should discuss more about what the data streams should really be to provide a good experience, especially around applying features like ILM and index authorization.

IMO we should be following the pattern laid down by the data stream naming scheme in general, that is that the data stream names correlate to what kind of data is contained in the data stream rather than where the data came from. Of course we probably still want an elastic_agent.input field for filtering, but I doubt users will want to apply different data retention policies to filestream monitoring data vs. httpjson data.

Here's a quick rough idea of groupings that could make more sense from my perspective:

logs-elastic_agent.core
logs-elastic_agent.output
logs-elastic_agent.input
metrics-elastic_agent.core
metrics-elastic_agent.output
metrics-elastic_agent.input
metrics-elastic_agent.cpu
metrics-elastic_agent.memory
metrics-elastic_agent.network
metrics-elastic_agent.filesystem

This type of grouping would allow users to retain information like input and output metrics for longer periods of time than cpu and memory metrics of the agent. This may be useful for more general ingest performance tracking over time to help pinpoint data sources that may be causing a lot more noise or have seasonal patterns. cpu and memory don't have that same useful shelf-life IMO as they're more useful for debugging interactively something that you're trying to optimize.

joshdover commented 1 year ago

cc @nimarezainia I think you likely have useful info to weigh in here

blakerouse commented 1 year ago

@joshdover At the moment we map the data streams to the component.dataset. I like @joshdover idea on its probably not something customers will want that level of control over in the case of an ILM for a specific component. We can still offer the filtering in the log viewer but with the ILM mapping to similar roles listed above.

nimarezainia commented 1 year ago

+1 on this approach. CPU and Mem metrics don't need longevity and giving users the ability to apply other Elastic features like ILM to the data they collect is important.

Question: As an example What grouping would "logs-elastic_agent.apm_server-testnamespace123" map to under this proposal?

joshdover commented 1 year ago

As an example What grouping would "logs-elastic_agent.apm_server-testnamespace123" map to under this proposal?

Those logs would go to logs-elastic_agent.input-testnamespace123 with a field like unit: apm-server for filtering.

jlind23 commented 1 year ago

@blakerouse @cmacknz What is the status here? @joshdover will your PR fix all the problems mentioned here?

blakerouse commented 1 year ago

@jlind23 We just need to come to a 100% agreement on the index names we want to use. Then we need to make the change in the Elastic Agent as well as the permissions that are requested for the API key.

cmacknz commented 1 year ago

I think Josh's suggestions in https://github.com/elastic/elastic-agent/issues/1814#issuecomment-1342413472 are what we should go with, nobody seems opposed to it.

joshdover commented 1 year ago

@joshdover will your PR fix all the problems mentioned here?

My PR doesn't address this. Agree we should be unblocked to move forward. It will require changes across Agent, Fleet Server, and the elastic_agent package.

blakerouse commented 1 year ago

The more I think about this structure I do see an issue.

What does into the *.output data stream? Only the shipper logs? At the moment the beats do not place unit.id in its log message so the Elastic Agent could identify the difference between a log message for an input versus and output log.

Do we need to seperate the *.input and *.output events? Maybe we could just go with *.component and all component info go into that index?

cmacknz commented 1 year ago

Do we need to seperate the .input and .output events?

It seems like a useful separation to me, particularly for allowing output metrics to be separated and with a possibly different ILM policy than input metrics. The output performance metrics can be quite critical to diagnosing performance issues and tuning ingest performance.

I do agree that we are unlikely to use the logs-elastic_agent.output initially since we'd have to modify Beats to write to it. The first use of this will be the shipper in the near-ish future. I don't know that it is a problem for this to be empty initially.

blakerouse commented 1 year ago

Well being that beats cannot be divided easily the .input and .output I believe that calling it .input would be incorrect.

cmacknz commented 1 year ago

The Beats logs could likely be divided with enough effort but it's probably not worth doing. This input and output division problem is solved by the introduction of the shipper, so we could just not implement this until the shipper is ready.

We could go with components instead of the input and output division, but I wouldn't want to do that if it is just to work around a temporary state of the V2 architecture where the shipper isn't usable yet.

blakerouse commented 1 year ago

I like the idea of just placing them all under components. Do we really see ILM needed between a shipper and a component? To me they seem like the would almost always share the same lifecycle.

cmacknz commented 1 year ago

The output/shipper logs and metrics will have a much stronger correlation with overall ingest performance, so to me they will always be more important and it seems valuable to separate them out for that reason.

We could combine everything into a components index to start and then split the output index out later I suppose. That is definitely much easier to start and avoids having to deal with routing the Beat output metrics and logs to the right index initially.

elastic / elastic-agent

V2: Elastic agent monitoring data stream names should not be per process #1814