elastic / apm-server

https://www.elastic.co/guide/en/apm/guide/current/index.html
Other
1.21k stars 517 forks source link

Configurable metrics-apm datastream pattern. #8182

Open evilezh opened 2 years ago

evilezh commented 2 years ago

Problem in short: I couldn't find a way to tell - not to split metrics per service. Now it is weird pattern - metrics-apm.app.<service-name>-<ns> . Now imagine - I've 100 services ... in 5 namespaces. which would create 500 data streams ... each data stream would have ILM ...

None of those indexes will properly fill up, neither lifecycle will be ok. I would prefer in my case to use common index for all metrics-apm and recycle with single policy.

axw commented 2 years ago

@evilezh thanks for opening the issue. We have been thinking about adding something like this, but no concrete plans yet. We might come back with some questions for you in the future, when we've prioritised this.

axw commented 2 years ago

One thing I should have noted earlier: for builtin metrics (i.e. those measured by Elastic APM agents) specifically, we will start sending these to a common data stream per namespace as of 8.3. See https://github.com/elastic/apm-server/issues/7520.

glucaci commented 2 years ago

We have also a problem with this one and starting seeing JVM memory pressure on all our cluster because of it.

The problem is if you have like 200 services, which we will reach soon and a index rollover of 10 days you will reach only with metrics index 600 shards / month and if you want to keep the data for 3-4 months than you need a lot of RAM which is not justifiable related to the used disk space.

I see it as a high priority issue to have the possibility to merge all the metrics indexes in one as for logs and traces.

Thanks!

glucaci commented 1 year ago

@axw any news about this feature?

axw commented 1 year ago

@glucaci since 8.3 we've been sending Elastic APM agent built-in metrics to a common data stream. Custom metrics still go to service-specific data streams, and we don't yet have a solution for splitting them out.

Are you on a recent version of the stack? Are you still observing issues?

glucaci commented 1 year ago

Currently we don't have any memory issues but the cluster has the maximum shards allocation which means we cannot create any additional shard. (e.g. adding a watcher). 95% of the indices that we have are application metrics .ds-metrics-apm.app.the-application-name

There is any plans to do this in a standard way for the Elastic APM agent and Open-Telemetry ?

We have a temporary solution which I didn't tried yet from a support ticket, which implies to change the metrics ingestion pipeline and add the following script

{
  "script": {
    "source": """
      ctx["data_stream.dataset"] = "apm.app.all";
      ctx["_index"] = "metrics-apm.app.all-" + ctx["data_stream.namespace"];
    """
  }
}

If this is a good solution why is not coming in the elastic release?

Thanks!

axw commented 1 year ago

@glucaci which version of the stack are you on? Would you be able to share a document for a few different .ds-metrics-apm.app.the-application-name indices (for different values of "the-application-name")? It may help us identify whether there's a bug with the current metrics-combining code, or whether it's just the custom metrics that we don't yet have a solution for.

We have a temporary solution which I didn't tried yet from a support ticket, which implies to change the metrics ingestion pipeline and add the following script ... If this is a good solution why is not coming in the elastic release?

This is a workaround that won't work in all situations. It will work if there is no overlap between the metrics across the different services, or if they overlap but the metric definitions do not conflict. If there are conflicts, then it would prevent ingestion.

glucaci commented 1 year ago

Sure, bellow you can see a document from one of the apps.

{
  "_index": ".ds-metrics-apm.app.api_1",
  "_id": "i4vUnoQBHKZw1ZUhanQN",
  "_version": 1,
  "_score": 0,
  "_source": {
    "agent": {
      "name": "dotnet",
      "version": "1.4.0.599"
    },
    "process.runtime.dotnet.gc.committed": 84,
    "data_stream.namespace": "default",
    "data_stream.type": "metrics",
    "processor": {
      "name": "metric",
      "event": "metric"
    },
    "labels": {
      "service_namespace": "Api_1"
    },
    "metricset.name": "app",
    "observer": {
      "hostname": "55923b643526",
      "id": "06d79c18-d317-400f-b8e2-9a74b8974db4",
      "type": "apm-server",
      "ephemeral_id": "351565c4-2459-49ae-a13b-40a07173d9f7",
      "version": "8.5.1"
    },
    "@timestamp": "2022-11-22T10:13:50.595Z",
    "ecs": {
      "version": "1.12.0"
    },
    "service": {
      "node": {
        "name": "8e2ba94d-f741-4d24-ab4b-d9e58ca1bfbe"
      },
      "environment": "DEV",
      "name": "Api_1 Api",
      "language": {
        "name": "unknown"
      },
      "version": "1.52.0.0"
    },
    "data_stream.dataset": "apm.app.api_1",
    "event": {
      "agent_id_status": "missing",
      "ingested": "2022-11-22T10:13:51Z"
    }
  },
  "fields": {
    "service.environment": ["DEV"],
    "process.runtime.dotnet.gc.committed": [84],
    "service.name": ["Api_1 Api"],
    "data_stream.namespace": ["default"],
    "processor.name": ["metric"],
    "service.node.name": ["8e2ba94d-f741-4d24-ab4b-d9e58ca1bfbe"],
    "service.language.name": ["unknown"],
    "observer.hostname": ["55923b643526"],
    "data_stream.type": ["metrics"],
    "metricset.name": ["app"],
    "event.ingested": ["2022-11-22T10:13:51.000Z"],
    "observer.id": ["06d79c18-d317-400f-b8e2-9a74b8974db4"],
    "@timestamp": ["2022-11-22T10:13:50.595Z"],
    "service.version": ["1.52.0.0"],
    "observer.ephemeral_id": ["351565c4-2459-49ae-a13b-40a07173d9f7"],
    "observer.version": ["8.5.1"],
    "observer.type": ["apm-server"],
    "ecs.version": ["1.12.0"],
    "data_stream.dataset": ["apm.app.api_1"],
    "processor.event": ["metric"],
    "agent.name": ["dotnet"],
    "agent.version": ["1.4.0.599"],
    "event.agent_id_status": ["missing"],
    "labels.service_namespace": ["Api_1"]
  }
}

The document is the same for all the apps but with different fields which are exported with the open-telemetry instrumentation for dotnet runtime image

In this case it will work the workaround?

The same metrics format are used also by the java open-telemetry instrumentation. There are any plans to align also the Elastic APM agent with the ones from open-telemetry and create a "standard" ingestion?

Thanks!

axw commented 1 year ago

Thanks @glucaci!

In this case it will work the workaround?

The metrics look like they shouldn't collide with any others - I think the workaround is safe in this case.

The same metrics format are used also by the java open-telemetry instrumentation. There are any plans to align also the Elastic APM agent with the ones from open-telemetry and create a "standard" ingestion?

We do have some plans to map OpenTelemetry metrics to the ones our agents produce. We already do this for JVM runtime metrics, but we haven't yet done it for .NET/CLR metrics. Although we map the JVM metrics, we also record the original OTel metrics; so this means we're still creating application-specific data streams for OTel-instrumented Java programs. I think we'll need to revisit that decision.

nyp-cgranata commented 10 months ago

We are experiencing the same issue as the folks above.

@axw Have there been any updates?

axw commented 10 months ago

@nyp-cgranata we recently added support for configurable routing via ingest pipelines: https://github.com/elastic/apm-server/issues/10991

In the not too distant future we intend to make at least following changes: