elastic / apm-server

https://www.elastic.co/guide/en/apm/guide/current/index.html
Other
1.21k stars 518 forks source link

APM service map assumes service.environment is null for some services, possibly causing missing links in service map. #8254

Open davemoore- opened 2 years ago

davemoore- commented 2 years ago

Cross-posting from this discussion board thread which I now believe could be a bug.

Elastic version: 8.1.2

Elastic environment: Elastic Cloud on GCP us-east1

APM instrumentation: OpenTelemetry

Client browser: Chrome

Client OS: MacOS 12.1

Describe the bug: The APM Service Map incorrectly sets service.environment to null for two of my services, which could might be the reason why they appear orphaned in the service map. I've verified that the spans for those services do have service.environment set to development in every span that references those services. This behavior is happening consistently for only those two services, even after starting with a fresh dataset and cluster. Each Python service is instrumented using the same code and the same environment variables (OTEL_RESOURCE_ATTRIBUTES set to deployment.environment=development), and therefore they all should behave very similarly for tracing.

Steps to reproduce: This is an instance of microbs ecommerce application. It might be easier to troubleshoot if I provided direct access to the Elastic Cloud deployment where the APM data resides, because the deployment does not have sensitive data.

*Response from `GET https://ELASTICSEARCH_ENDPOINT/.ds-traces-apm/_search?q=(service.name:payment+OR+service.name:product)+AND+NOT+service.environment:development** - Observe that there are no spans for the payment or product service in whichservice.environmentis notdevelopment`.

{
  "took" : 85,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

Response from GET https://KIBANA_ENDPOINT/internal/apm/service-map - Observe that service.environment is null for the payment and product services, which is inconsistent with the results of the prior query.

{
  "elements": [{
    "data": {
      "id": "web-gateway",
      "service.environment": "development",
      "service.name": "web-gateway",
      "agent.name": "opentelemetry/python"
    }
  }, {
    "data": {
      "id": "api-gateway",
      "service.name": "api-gateway",
      "agent.name": "opentelemetry/cpp"
    }
  }, {
    "data": {
      "id": "content",
      "service.environment": "development",
      "service.name": "content",
      "agent.name": "opentelemetry/python"
    }
  }, {
    "data": {
      "span.subtype": "http",
      "span.destination.service.resource": "storage.googleapis.com:443",
      "span.type": "external",
      "id": ">storage.googleapis.com:443",
      "label": "storage.googleapis.com:443"
    }
  }, {
    "data": {
      "id": "checkout",
      "service.environment": "development",
      "service.name": "checkout",
      "agent.name": "opentelemetry/python"
    }
  }, {
    "data": {
      "id": "cart",
      "service.environment": "development",
      "service.name": "cart",
      "agent.name": "opentelemetry/python"
    }
  }, {
    "data": {
      "span.subtype": "redis",
      "span.destination.service.resource": "redis",
      "span.type": "db",
      "id": ">redis",
      "label": "redis"
    }
  }, {
    "data": {
      "service.name": "product",
      "agent.name": "opentelemetry/python",
      "service.environment": null,
      "id": "product"
    }
  }, {
    "data": {
      "service.name": "payment",
      "agent.name": "opentelemetry/python",
      "service.environment": null,
      "id": "payment"
    }
  }, {
    "data": {
      "source": "api-gateway",
      "target": "cart",
      "id": "api-gateway~cart",
      "sourceData": {
        "id": "api-gateway",
        "service.name": "api-gateway",
        "agent.name": "opentelemetry/cpp"
      },
      "targetData": {
        "id": "cart",
        "service.environment": "development",
        "service.name": "cart",
        "agent.name": "opentelemetry/python"
      }
    }
  }, {
    "data": {
      "source": "api-gateway",
      "target": "checkout",
      "id": "api-gateway~checkout",
      "sourceData": {
        "id": "api-gateway",
        "service.name": "api-gateway",
        "agent.name": "opentelemetry/cpp"
      },
      "targetData": {
        "id": "checkout",
        "service.environment": "development",
        "service.name": "checkout",
        "agent.name": "opentelemetry/python"
      },
      "bidirectional": true
    }
  }, {
    "data": {
      "source": "api-gateway",
      "target": "content",
      "id": "api-gateway~content",
      "sourceData": {
        "id": "api-gateway",
        "service.name": "api-gateway",
        "agent.name": "opentelemetry/cpp"
      },
      "targetData": {
        "id": "content",
        "service.environment": "development",
        "service.name": "content",
        "agent.name": "opentelemetry/python"
      }
    }
  }, {
    "data": {
      "source": "cart",
      "target": ">redis",
      "id": "cart~>redis",
      "sourceData": {
        "id": "cart",
        "service.environment": "development",
        "service.name": "cart",
        "agent.name": "opentelemetry/python"
      },
      "targetData": {
        "span.subtype": "redis",
        "span.destination.service.resource": "redis",
        "span.type": "db",
        "id": ">redis",
        "label": "redis"
      }
    }
  }, {
    "data": {
      "source": "checkout",
      "target": "api-gateway",
      "id": "checkout~api-gateway",
      "sourceData": {
        "id": "checkout",
        "service.environment": "development",
        "service.name": "checkout",
        "agent.name": "opentelemetry/python"
      },
      "targetData": {
        "id": "api-gateway",
        "service.name": "api-gateway",
        "agent.name": "opentelemetry/cpp"
      },
      "isInverseEdge": true
    }
  }, {
    "data": {
      "source": "content",
      "target": ">storage.googleapis.com:443",
      "id": "content~>storage.googleapis.com:443",
      "sourceData": {
        "id": "content",
        "service.environment": "development",
        "service.name": "content",
        "agent.name": "opentelemetry/python"
      },
      "targetData": {
        "span.subtype": "http",
        "span.destination.service.resource": "storage.googleapis.com:443",
        "span.type": "external",
        "id": ">storage.googleapis.com:443",
        "label": "storage.googleapis.com:443"
      }
    }
  }, {
    "data": {
      "source": "web-gateway",
      "target": "api-gateway",
      "id": "web-gateway~api-gateway",
      "sourceData": {
        "id": "web-gateway",
        "service.environment": "development",
        "service.name": "web-gateway",
        "agent.name": "opentelemetry/python"
      },
      "targetData": {
        "id": "api-gateway",
        "service.name": "api-gateway",
        "agent.name": "opentelemetry/cpp"
      }
    }
  }]
}

Screenshot 1 of 2 - The service map is missing links among the product and payment services, whose service.environment is set to null in the XHR response with the service map data. Note that service.environment is actually set to development in all of the spans for those two services, which I confirmed by searching in Discover.

service-map

Screenshot 2 of 2 - This trace sample does display links between the services that were unlinked in the service map. This screenshot shows that the payment service is linked to the api-gateway service, but that link doesn't appear in the service map.

trace
davemoore- commented 2 years ago

I was able to fix the symptoms by changing api-gateway from an Nginx service instrumented with opentelemetry/cpp to a Python service instrumented opentelemetry/python.

I don't think this is ready to be marked as resolved until we can determine why the opentelemetry/cpp instrumentation resulted in an incorrect presentation of data in Elastic APM. Plausibly, the opentelemetry/cpp instrumentation could have omitted data that Elastic APM required, or it could be that Elastic APM is treating the opentelemetry/cpp data differently. I'm inclined to think it's the former, but I'm not certain. I'll need to look for differences in span data produced by opentelemetry/cpp and opentelemetry/python.

Is there guidance on which fields the service map queries to present its graphical view?

elasticmachine commented 2 years ago

Pinging @elastic/apm-ui (Team:apm)

dgieselaar commented 2 years ago

FWICT from the trace waterfall api-gateway connects to payment without an exit span, that is, the parent of the transaction on payment is a transaction on the api-gateway service. This possibly points to an instrumentation gap. We use exit spans (not transactions) to decide what traces should be sampled for discovering connections, which might be why the connection is not showing up.

As to why service.environment is missing for the product and payment services: we fetch data for all (related) services and show them as orphans in the service map if they don't show up in the traces we've sampled. We don't return anything for service.environment there, so that's expected.

I think this investigation should indeed focus on opentelemetry/cpp (and differences with opentelemetry/python and our own agents). It's likely that opentelemetry/cpp doesn't create the exit spans we need. I'm not intimately familiar with how OTel spans are translated to exit spans on APM Server to be honest. In any case, I don't think this is a Kibana issue. Should we perhaps move it to the APM Server repo (where exit spans are created for OTel)? @dannycroft

dannycroft commented 2 years ago

@dgieselaar Yeah, this doesn't sound like a Kibana issue.

@simitt do you want to move this over to the APM Server repo for further investigation?

cc// @felixbarny

simitt commented 2 years ago

I think this investigation should indeed focus on opentelemetry/cpp (and differences with opentelemetry/python and our own agents). It's likely that opentelemetry/cpp doesn't create the exit spans we need.

@davemoore- would you be able to provide two sample events that are sent to the APM Server - one from opentelemetry/cpp and one from opentelemetry/python? We can then take a look at the differences and try to identify whether we need to make adoptions in the APM Server code or if there is something indeed missing in opentelemetry/cpp.

mholttech commented 7 months ago

I'm experiencing similar behavior when running the OTEL Demo

image