elastic / apm

Elastic Application Performance Monitoring - resources and general issue tracking for Elastic APM.

https://www.elastic.co/apm

Apache License 2.0

384 stars 114 forks source link

Spec 'error.exception.type' for an Elasticsearch client error to include responseBody.error.type #653

Closed trentm closed 2 years ago

trentm commented 2 years ago

See https://github.com/elastic/apm-agent-nodejs/issues/2770 for earlier discussion. Currently when an Elasticsearch client returns an API error -- i.e. an error from Elasticsearch itself -- the captured APM error object (at least for the Node.js agent) has a generic error.exception.type == "ResponseError" and a error.exception.stacktrace that is identical for different types of Elasticsearch API errors. The result is that all ES errors (e.g. resource_not_found_exception, illegal_argument_exception) are all grouped together in the APM app (see the APM server logic for setting error.grouping_key).

The solution discussed above and implemented in the Node.js APM agent is to set error.exception.type to $exceptionType (${esResponseBody.error.type}) for ES API errors.

checklist

May the instrumentation collect sensitive information, such as secrets or PII (ex. in headers)?
- [ ] Yes
- [ ] Add a section to the spec how agents should apply sanitization (such as sanitize_field_names)
- [x] No
- [x] Why? Just adding an ES API code.
- [ ] n/a
[x] Create PR as draft (I messed up the process and created it as not-draft initially)
[x] Approval by at least one other agent
[x] Mark as Ready for Review (automatically requests reviews from all agents and PM via CODEOWNERS)
- Remove PM from reviewers if impact on product is negligible
- Remove agents from reviewers if the change is not relevant for them
[x] Approved by at least 2 agents + PM (if relevant)
[x] Merge after 7 days passed without objections
[x] Create implementation issues through the meta issue template: https://github.com/elastic/apm/issues/671

apmmachine commented 2 years ago

:green_heart: Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

#### Build stats * Start Time: 2022-08-12T05:24:00.809+0000 * Duration: 3 min 0 sec

trentm commented 2 years ago

I failed on this and created it as not draft, initially. Sorry for the noise.

eyalkoren commented 2 years ago

I investigated a bit how we can get that on the Java agent, looks like it won't be straightforward unfortunately. Within the new Java API client, the response error type is readily available through co.elastic.clients.elasticsearch._types.ElasticsearchException.response().error().type(), however, our entire instrumentation relies on the low level client (thus supports the Java REST client as well), where we have access the response body, but we would have to parse it into JSON.

Here are some options on getting the error type within the low level client:

The raw response body is available through org.elasticsearch.client.ResponseException.getResponse().getEntity().getContent() as ByteArrayInputStream, which we will then have to parse into String (e.g. something like org.apache.http.util.EntityUtils#toString()) and then make into JSON
org.elasticsearch.client.ResponseException.getMessage() contains the raw body as a string, but not only, so it looks something like: method [DELETE], host [http://localhost:51016], URI [/my-second-index?master_timeout=30s&ignore_unavailable=false&expand_wildcards=open%2Cclosed&allow_no_indices=true&ignore_throttled=false&timeout=30s], status line [HTTP/1.1 404 Not Found] {"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index [my-second-index]","resource.type":"index_or_alias","resource.id":"my-second-index","index_uuid":"_na_","index":"my-second-index"}],"type":"index_not_found_exception","reason":"no such index [my-second-index]","resource.type":"index_or_alias","resource.id":"my-second-index","index_uuid":"_na_","index":"my-second-index"},"status":404}, from which we can parse the value through a pattern, such as: ..."type":"<VALUE>"...

Alternatives:

Instrument the new Java API client only for this purpose (seems like an overkill)
Rely on something else available in the low level client, for example: org.elasticsearch.client.ResponseException.getResponse().getStatusLine().getReasonPhrase(), for which the value corresponding the above error message would give: Not Found

felixbarny commented 2 years ago

I'd say parsing the JSON in the agent would be a no-go. The only two options I see are instrumenting the high-level client or not implementing this spec in the Java agent.

eyalkoren commented 2 years ago

I'd say parsing the JSON in the agent would be a no-go

I agree. However, extracting from the message String what comes after "type":" is reasonable I think both in terms of performance and accuracy.

The only two options I see are instrumenting the high-level client...

I don't agree 🙂 . I don't think that's an option we should consider. It means a plugin only for this purpose.

... or not implementing this spec in the Java agent

Definitely not a big deal. However, I do think that if you change the spec to say we append anything that is error-type-specific (rather than saying it needs to be error.type coming from response body) AND consistent to the error object type, then using ResponseException.getResponse().getStatusLine().getReasonPhrase() is a valid option.

trentm commented 2 years ago

However, I do think that if you change the spec to say we append anything that is error-type-specific (rather than saying it needs to be error.type coming from response body) AND consistent to the error object type, then using ResponseException.getResponse().getStatusLine().getReasonPhrase() is a valid option.

That'll be the set of generic HTTP status code reason phrases, IIUC. That would be less useful to the user than the error.type, which IIUC, maps mostly to the set of ElasticsearchException subclasses in Elasticsearch. This very unscientific grep shows a lot of grouping of various error.type values into fewer HTTP response status codes:

[11:03:50 trentm@pink:~/el/elasticsearch (git:master)]
% rg 'return RestStatus\.' | cut -d':' -f2 | awk '{$1=$1;print}' | sort | uniq -c | sort -n
   1 return RestStatus.MULTI_STATUS;
   1 return RestStatus.fromCode(esse.getResponse().getStatusLine().getStatusCode());
   1 return RestStatus.status(
   1 return RestStatus.status(shardStats.successfulShards, shardStats.totalShards, shardStats.shardFailures);
   1 return RestStatus.status(successfulShards, totalShards, shardFailures);
   1 return RestStatus.valueOf(in.readString());
   3 return RestStatus.CONFLICT;
   3 return RestStatus.TOO_MANY_REQUESTS;
   4 return RestStatus.ACCEPTED;
   4 return RestStatus.OK.getStatus() == response.getStatusLine().getStatusCode();
   8 return RestStatus.INTERNAL_SERVER_ERROR;
   8 return RestStatus.SERVICE_UNAVAILABLE;
  10 return RestStatus.NOT_FOUND;
  22 return RestStatus.OK;
  31 return RestStatus.BAD_REQUEST;

I don't have a sense of the Java agent perf impact, but I'd tend to prefer @eyalkoren's extracting from the message String what comes after "type":" idea.

eyalkoren commented 2 years ago

That would be less useful to the user than the error.type

Isn't this more for the purpose of increasing error type cardinality, rather than making it more user-readable?

trentm commented 2 years ago

Isn't this more for the purpose of increasing error type cardinality, rather than making it more user-readable?

Certainly separating error groups based on HTTP status code would increase the cardinality some. However, Elasticsearch necessarily lumps a number of different error types together. For example, all of these roughly 20-30 Elasticsearch exceptions are lumped together into RestStatus.BAD_REQUEST, so would be in the same error group:

[~/el/elasticsearch (git:master)]% rg 'return RestStatus\.BAD_REQUEST'  | cut -d: -f1 | xargs -n1 basename | cut -d. -f1
ParsingException
FoldingException
PlanningException
VerificationException
ParsingException
MappingException
ParsingException
PlanningException
VerificationException
FeatureNotEnabledException
ResourceAlreadyExistsException
InvalidIndexTemplateException
IndexClosedException
InvalidIndexNameException
InvalidTypeNameException
InvalidAliasNameException
ExceptionsHelper
ExceptionsHelper
ParsingException
SearchParseException
QueryShardException
DocumentSourceMissingException
StrictDynamicMappingException
MapperParsingException
RoutingMissingException
ScriptException
SnapshotInProgressException
TcpTransport
InvalidSnapshotNameException
ElasticsearchParseException
TaskCancelledException

The original issue description was "Different Elasticsearch errors end up with identical error.grouping_key" and showed two error examples that both had a 404 status code, FWIW. Not that the Java implementation needs to deal with that particular case, however.

eyalkoren commented 2 years ago

Makes sense, thanks for explaining. Either way, I'd be more comfortable with doing a lightweight dumb search for "type":", supporting most but not all use cases, if we use MAY instead of SHOULD. WDYT?

BTW, I had a discussion with the clients team and it seems they are interested in embedding custom APM API usage, at least for the Java client, which should provide the most accurate and always up-to-date instrumentation. I proposed OTel of course, but we will probably need to extend our OTel attribute mappings in order to support something like that, for example through io.opentelemetry.api.trace.Span#recordException(Throwable, Attributes).

trentm commented 2 years ago

if we use MAY instead of SHOULD. WDYT?

I am definitely fine with that.

BTW, I had a discussion with the clients team and it seems they are interested in embedding custom APM API usage, at least for the Java client...

Oh, interesting. The JS client has "observability" or "diagnostic" events that are meant to assist with instrumentation: https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/observability.html Unfortunately, the Node.js APM agent stopped using those a while back because it wasn't possible to get the async-context handling correct with those. We are back to monkey-patching the Elasticsearch client.

My understanding is that the Node community will likely not directly add depedencies on the OTel API for embedding custom instrumentation -- mainly because the @opentelemetry/api package is too large. Instead, Node.js core has a newish, experimental, lightweight thing called diagnostic_channels that modules can use. The idea is that APM agents can do instrumentation using those channels provided by modules.

IIUC, OpenTelemetry is currently extending its Context API to support being able to handle async-context properly using diagnostic_channels: https://github.com/open-telemetry/opentelemetry-js-contrib/issues/1021#issuecomment-1126891080