elastic / apm-server

https://www.elastic.co/guide/en/apm/guide/current/index.html
Other
1.22k stars 519 forks source link

Confusing log events for invalid requests #14232

Open up2neck opened 1 day ago

up2neck commented 1 day ago

APM Server log events and correlating trace (with self-instrumentation enabled) contains 2 different HTTP errors:

Whether log event contains 503 error with "request timed out" error:

Log entry ```json { "message": "request timed out", "url.original": "/intake/v2/events", "http.request.body.bytes": 0, "log.level": "error", "transaction.id": "d91bb495a6ab25bb", "http.request.method": "POST", "ecs.version": "1.6.0", "http.response.status_code": 503, "error.message": "request timed out", "log.logger": "request", "@timestamp": "2024-10-02T07:22:04.439Z", "service.name": "apm-server", "user_agent.original": "apm-agent-java/1.45.0 (autocode-app 2.0.0)", "log.origin": { "function": "github.com/elastic/apm-server/internal/beater/api.apmMiddleware.LogMiddleware.func1.1", "file.name": "middleware/log_middleware.go", "file.line": 59 }, "event.duration": 3499790781, "http.request.id": "d91bb495a6ab25bb", "source.address": "000.000.000.000", "trace.id": "d91bb495a6ab25bbdede8859c7cf8c8a" } ```

APM interface shown actual error was:

Image

Trace body ```json { "_index": ".ds-traces-apm-dummy_sub2-2024.09.20-000009", "_id": "urkbTJIBO4jYgZRmiL6k", "_version": 1, "_score": 0, "_source": { "agent": { "name": "go", "version": "2.6.0" }, "process": { "args": [ "apm-server", "run", "-e", "-c", "config/config-secret/apm-server.yml" ], "pid": 1, "title": "apm-server" }, "source": { "nat": { "ip": "00.000.00.000" }, "ip": "000.000.000.000" }, "processor": { "event": "transaction" }, "url": { "path": "/intake/v2/events", "scheme": "https", "domain": "mydomain", "full": "https://mydomain/intake/v2/events" }, "labels": { "project": "dummy-sub2" }, "cloud": { "availability_zone": "europe-west3-a", "instance": { "name": "gke-dummy-sub-elastic-europe-w-generic-c4e5a328-8ipl", "id": "6772592666386103052" }, "provider": "gcp", "project": { "id": "or2-ms-dummy-sub-elastic-t1iylu" }, "region": "europe-west3" }, "edp": { "domain": "mydomain" }, "observer": { "hostname": "apm-server-v2-apm-server-6f6db466bd-22ccb", "type": "apm-server", "version": "8.14.3" }, "trace": { "id": "d91bb495a6ab25bbdede8859c7cf8c8a" }, "@timestamp": "2024-10-02T07:22:00.940Z", "data_stream": { "namespace": "dummy_sub2", "type": "traces", "dataset": "apm" }, "service": { "node": { "name": "apm-server-v2-apm-server-6f6db466bd-22ccb" }, "environment": "sandbox", "name": "apm-server", "runtime": { "name": "gc", "version": "go1.22.7" }, "language": { "name": "go", "version": "go1.22.7" }, "version": "8.14.3" }, "host": { "hostname": "apm-server-v2-apm-server-6f6db466bd-22ccb", "os": { "platform": "linux" }, "name": "apm-server-v2-apm-server-6f6db466bd-22ccb", "architecture": "amd64" }, "client": { "geo": { "region_iso_code": "BE-BRU", "continent_name": "Europe", "city_name": "Brussels", "country_iso_code": "BE", "country_name": "Belgium", "location": { "lon": 4.347, "lat": 50.8534 }, "region_name": "Brussels Capital" }, "ip": "000.000.000.000" }, "http": { "request": { "headers": { "X-Cloud-Trace-Context": [ "61ce7649b7546eb12e48bd1b05dc5acf/7453092485412124311" ], "Cache-Control": [ "no-cache" ], "Accept": [ "*/*" ], "Content-Encoding": [ "deflate" ], "X-Forwarded-Proto": [ "https" ], "User-Agent": [ "apm-agent-java/1.45.0 (autocode-app 2.0.0)" ], "X-Forwarded-For": [ "000.000.000.000,00.00.00.000" ], "Pragma": [ "no-cache" ], "Via": [ "1.1 google" ], "Content-Type": [ "application/x-ndjson" ] }, "method": "POST", "cookies_list": [], "cookies": {} }, "response": { "headers": { "X-Content-Type-Options": [ "nosniff" ], "Connection": [ "Close" ], "Content-Type": [ "application/json" ] }, "status_code": 400 }, "version": "1.1" }, "event": { "agent_id_status": "missing", "ingested": "2024-10-02T07:22:05Z", "success_count": 1, "outcome": "success" }, "transaction": { "result": "HTTP 4xx", "duration": { "us": 3499844 }, "representative_count": 1, "name": "POST /intake/v2/events", "id": "d91bb495a6ab25bb", "span_count": { "dropped": 0, "started": 0 }, "type": "request", "sampled": true }, "user_agent": { "original": "apm-agent-java/1.45.0 (autocode-app 2.0.0)", "name": "Other", "device": { "name": "Other" } }, "span": { "id": "d91bb495a6ab25bb" }, "timestamp": { "us": 1727853720940194 } }, "fields": { "transaction.name.text": [ "POST /intake/v2/events" ], "http.request.headers.Accept": [ "*/*" ], "http.response.headers.Connection": [ "Close" ], "transaction.representative_count": [ 1 ], "user_agent.original.text": [ "apm-agent-java/1.45.0 (autocode-app 2.0.0)" ], "client.geo.country_iso_code": [ "BE" ], "host.hostname": [ "apm-server-v2-apm-server-6f6db466bd-22ccb" ], "process.pid": [ 1 ], "service.language.name": [ "go" ], "transaction.result": [ "HTTP 4xx" ], "cloud.availability_zone": [ "europe-west3-a" ], "process.title.text": [ "apm-server" ], "transaction.id": [ "d91bb495a6ab25bb" ], "http.request.headers.Via": [ "1.1 google" ], "http.request.method": [ "POST" ], "processor.event": [ "transaction" ], "labels.project": [ "dummy-sub2" ], "source.ip": [ "000.000.000.000" ], "agent.name": [ "go" ], "host.name": [ "apm-server-v2-apm-server-6f6db466bd-22ccb" ], "client.geo.country_name": [ "Belgium" ], "event.agent_id_status": [ "missing" ], "http.response.status_code": [ 400 ], "http.version": [ "1.1" ], "event.outcome": [ "success" ], "user_agent.original": [ "apm-agent-java/1.45.0 (autocode-app 2.0.0)" ], "cloud.region": [ "europe-west3" ], "transaction.duration.us": [ 3499844 ], "service.runtime.version": [ "go1.22.7" ], "span.id": [ "d91bb495a6ab25bb" ], "edp.domain": [ "mydomain" ], "client.ip": [ "000.000.000.000" ], "user_agent.name": [ "Other" ], "data_stream.type": [ "traces" ], "host.architecture": [ "amd64" ], "http.request.headers.X-Forwarded-For": [ "000.000.000.000,00.00.00.000" ], "cloud.provider": [ "gcp" ], "timestamp.us": [ 1727853720940194 ], "url.path": [ "/intake/v2/events" ], "observer.type": [ "apm-server" ], "observer.version": [ "8.14.3" ], "agent.version": [ "2.6.0" ], "transaction.name": [ "POST /intake/v2/events" ], "process.title": [ "apm-server" ], "client.geo.continent_name": [ "Europe" ], "http.request.headers.Cache-Control": [ "no-cache" ], "http.request.headers.Content-Encoding": [ "deflate" ], "service.node.name": [ "apm-server-v2-apm-server-6f6db466bd-22ccb" ], "http.request.headers.X-Forwarded-Proto": [ "https" ], "url.scheme": [ "https" ], "source.nat.ip": [ "00.000.00.000" ], "transaction.sampled": [ true ], "cloud.instance.id": [ "6772592666386103052" ], "trace.id": [ "d91bb495a6ab25bbdede8859c7cf8c8a" ], "client.geo.region_name": [ "Brussels Capital" ], "client.geo.location": [ { "coordinates": [ 4.347, 50.8534 ], "type": "Point" } ], "event.success_count": [ 1 ], "transaction.span_count.dropped": [ 0 ], "http.request.headers.Content-Type": [ "application/x-ndjson" ], "url.full": [ "https://mydomain/intake/v2/events" ], "http.request.headers.User-Agent": [ "apm-agent-java/1.45.0 (autocode-app 2.0.0)" ], "service.environment": [ "sandbox" ], "client.geo.city_name": [ "Brussels" ], "client.geo.region_iso_code": [ "BE-BRU" ], "service.name": [ "apm-server" ], "data_stream.namespace": [ "dummy_sub2" ], "service.runtime.name": [ "gc" ], "process.args": [ "apm-server", "run", "-e", "-c", "config/config-secret/apm-server.yml" ], "observer.hostname": [ "apm-server-v2-apm-server-6f6db466bd-22ccb" ], "http.request.headers.X-Cloud-Trace-Context": [ "61ce7649b7546eb12e48bd1b05dc5acf/7453092485412124311" ], "url.full.text": [ "https://mydomain/intake/v2/events" ], "transaction.type": [ "request" ], "transaction.span_count.started": [ 0 ], "event.ingested": [ "2024-10-02T07:22:05.000Z" ], "http.request.headers.Pragma": [ "no-cache" ], "@timestamp": [ "2024-10-02T07:22:00.940Z" ], "service.version": [ "8.14.3" ], "host.os.platform": [ "linux" ], "data_stream.dataset": [ "apm" ], "http.response.headers.Content-Type": [ "application/json" ], "http.response.headers.X-Content-Type-Options": [ "nosniff" ], "service.language.version": [ "go1.22.7" ], "url.domain": [ "mydomain" ], "user_agent.device.name": [ "Other" ], "cloud.instance.name": [ "gke-dummy-sub-elastic-europe-w-generic-c4e5a328-8ipl" ], "cloud.project.id": [ "or2-ms-dummy-sub-elastic-t1iylu" ] } } ```
inge4pres commented 1 day ago

hello @up2neck can you please give a bit more informations on the environment? I see from the trace you're running APM Server v 8.14.3, is Elasticsearch deployed at the same version? The trace body and log body you posted are from documents stored in Elasticsearch?

up2neck commented 1 day ago

@inge4pres

Log entry was taken from APM Server logs (stdout, collected by GCP Loggins if it matters) Trace was taken from Elasticsearch itself.

Our team, has 2 environments (both facing the issue): GCP Load balancer -> APM Server (GKE, ECK) -> Elasticsearch cluster (GKE, ECK) GCP Load balancer -> APM Server (GKE, ECK) -> Elasticsearch cluster (Elastic Cloud)

My thoughts are some middleware, probably, rewrites the exact error, which leads to incorrect log entry.

inge4pres commented 22 hours ago

Thanks for the additional info 🙏🏼

The reason I was asking is that you may have transforms or ingest pipelines in Elasticsearch that remaps fields, in this case the http.response.status_code field. Do you know if there is special handling of this field?

My thoughts are some middleware, probably, rewrites the exact error, which leads to incorrect log entry.

I am not aware of anything inside apm-server binary modifying the value of this field, hence my confusion as to why it is different in logs than what's stored in ES as trace. I'll verify this is bug and report back.

up2neck commented 6 hours ago

@inge4pres

There are few additional steps on the top of default "APM" integration-provided ingest pipelines, we have, but all of them don't interact with http.response.status_code field, keeps it original.