elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.6k stars 8.21k forks source link

[APM] Service map does not work after upgrade from 8.14.3 to 8.15.1 #192380

Closed bvader closed 1 month ago

bvader commented 1 month ago

Kibana version: Upgrade from 8.14.3 to 8.15.1

Elasticsearch version: Upgrade from 8.14.3 to 8.15.1

Server OS version: Elastic Cloud GCP

Browser version: Latest Chrome

Browser OS version: Latest Mac

Original install method (e.g. download page, yum, from source, etc.): Elastic Cloud

Temp Workaround: Here

Describe the bug: Cluster receives telemetry from the OTEL Boutique Demo. All APM Features were working fine in 8.14.3 Upgraded via Cloud Console to 8.15.1 Service Map Fails with the following Other APM Feature such a Services, Traces, Transactions, Correlated Logs, Dependencies seem to be working Cursory look in the Elasticsearch and Kibana logs show no errors

Error while fetching resource
Error
search_phase_execution_exception Caused by: script_exception: runtime error (500)
URL
https://k8s-demo-stephenb.kb.us-west1.gcp.cloud.es.io:9243/internal/apm/service-map?start=2024-09-08T02%3A00%3A00.000Z&end=2024-09-09T02%3A34%3A35.535Z&environment=ENVIRONMENT_ALL&serviceGroup=&kuery=

Image

Other APM Feature such a Services, Traces, Transactions, Correlated Logs, Dependencies seem to be working Image

A couple the individual focused service maps work Image

Errors in browser console (if relevant):

Request URL:
https://k8s-demo-stephenb.kb.us-west1.gcp.cloud.es.io:9243/internal/apm/service-map?start=2024-09-09T02%3A26%3A39.773Z&end=2024-09-09T02%3A41%3A39.773Z&environment=ENVIRONMENT_ALL&serviceGroup=&kuery=
Request Method:
GET
Status Code:
500 Internal Server Error
Remote Address:
34.83.110.184:9243
Referrer Policy:
strict-origin-when-cross-origin

Response Headers
cache-control:
private, no-cache, no-store, must-revalidate
content-length:
185
content-security-policy:
script-src 'report-sample' 'self'; worker-src 'report-sample' 'self' blob:; style-src 'report-sample' 'self' 'unsafe-inline'; report-to violations-endpoint
content-security-policy-report-only:
form-action 'report-sample' 'self'; report-to violations-endpoint
content-type:
application/json; charset=utf-8
cross-origin-opener-policy:
same-origin
date:
Mon, 09 Sep 2024 02:47:10 GMT
kbn-license-sig:
xxx
kbn-name:
instance-0000000012
permissions-policy:
camera=(), display-capture=(), fullscreen=(self), geolocation=(), microphone=(), web-share=()
referrer-policy:
strict-origin-when-cross-origin
reporting-endpoints:
violations-endpoint="https://dep.kb.us-west1.gcp.cloud.es.io:9243/internal/security/analytics/_record_violations"
set-cookie:
sid=xxx; Secure; HttpOnly; Path=/

Provide logs and/or server output (if relevant):

Any additional context:

crespocarlos commented 1 month ago

Running the Service Map query on the Dev tools returns this error

{
      "type": "script_exception",
      "reason": "runtime error",
      "script_stack": [
        """if (parent['span.destination.service.resource'] != null
                    && !parent['span.destination.service.resource'].equals("")
                    && (!parent['service.name'].equals(event['service.name'])
                      || !parent['service.environment'].equals(event['service.environment'])
                    )
                  ) {
                    def """,
        "                                                                                                                                                                                                                                                                                         ^---- HERE"
      ],
      "script": " ...",
      "lang": "painless",
      "position": {
        "offset": 2936,
        "start": 2655,
        "end": 3029
      },
      "caused_by": {
        "type": "null_pointer_exception",
        "reason": "cannot access method/field [equals] from a null def reference"
      }
}

The problem happens because the service.environment is null, breaking the scripted_metrics agg.

Confirmed by verifying how many unique service.environment exists for the traces used in the query

{
  "took": 20,
  "timed_out": false,
  "_shards": {
    "total": 10,
    "successful": 10,
    "skipped": 9,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 545,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "spanDestinationServiceResources": {
      "value": 16
    },
    "serviceEnvironments": {
      "value": 0
    },
    "serviceNames": {
      "value": 16
    }
  }
}

What we need to do is to check if parent["service.environment"] != null here

[!NOTE] We must replicate the change in the serverless scripted metrics allow list config

elasticmachine commented 1 month ago

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

crespocarlos commented 1 month ago

cc @roshan-elastic @smith

bvader commented 1 month ago

Workaround:

You can these 2 pipelines which will then add service.environment: unknown to all documents that are missing the field, of course you can set to any value that you prefer. This should fix the service map issue untill a fix is released.

PUT _ingest/pipeline/traces-apm@custom
{
  "processors": [
    {
      "set": {
        "if": "ctx.service?.environment == null",
        "field": "service.environment",
        "value": "unknown"
      }
    }
  ]
}

PUT _ingest/pipeline/logs-apm.app@custom
{
  "processors": [
    {
      "set": {
        "if": "ctx.service?.environment == null",
        "field": "service.environment",
        "value": "unknown"
      }
    }
  ]
}
dgieselaar commented 1 month ago

@crespocarlos is this a mapping issue? I have never seen this break this way, and we do not require service.environment to be defined in the document (in fact this is why we have the ENVIRONMENT_NOT_DEFINED constant). Is service.environment missing from the mappings? If so this is likely a bug in APM Server.

crespocarlos commented 1 month ago

From what I could see, it was present in the mapping:

{
  ".ds-metrics-apm.app.adservice-default-2024.05.29-000030": {
    "mappings": {
      "service.environment": {
        "full_name": "service.environment",
        "mapping": {
          "environment": {
            "type": "keyword",
            "ignore_above": 1024
          }
        }
      }
    }
  }
  "partial-.ds-traces-apm-default-2024.08.08-000161": {
    "mappings": {
      "service.environment": {
        "full_name": "service.environment",
        "mapping": {
          "environment": {
            "type": "keyword",
            "ignore_above": 1024
          }
        }
      }
    }
  }
}

APM in general works fine with empty service.environment, but the service map query doesn't have any safeguards against empty service.environment.

crespocarlos commented 1 month ago

BTW, the error most likely started to happen after the upgrade because the comparison now uses equals method. We could revisit that change as part of this fix.

dgieselaar commented 1 month ago

@crespocarlos ahh I missed the fact that we changed the scripted metric agg like that. the mappings you list are from a metrics index and a frozen traces index however.

dgieselaar commented 1 month ago

it's also present in the index that actually has data for this query so you are right:

{
  ".ds-traces-apm-default-2024.09.08-000202": {
    "mappings": {
      "service.environment": {
        "full_name": "service.environment",
        "mapping": {
          "environment": {
            "type": "keyword",
            "ignore_above": 1024
          }
        }
      }
    }
  }
}