Closed trentm closed 2 years ago
I failed on this and created it as not draft, initially. Sorry for the noise.
I investigated a bit how we can get that on the Java agent, looks like it won't be straightforward unfortunately.
Within the new Java API client, the response error type is readily available through co.elastic.clients.elasticsearch._types.ElasticsearchException.response().error().type()
, however, our entire instrumentation relies on the low level client (thus supports the Java REST client as well), where we have access the response body, but we would have to parse it into JSON.
Here are some options on getting the error type within the low level client:
org.elasticsearch.client.ResponseException.getResponse().getEntity().getContent()
as ByteArrayInputStream
, which we will then have to parse into String (e.g. something like org.apache.http.util.EntityUtils#toString()
) and then make into JSONorg.elasticsearch.client.ResponseException.getMessage()
contains the raw body as a string, but not only, so it looks something like: method [DELETE], host [http://localhost:51016], URI [/my-second-index?master_timeout=30s&ignore_unavailable=false&expand_wildcards=open%2Cclosed&allow_no_indices=true&ignore_throttled=false&timeout=30s], status line [HTTP/1.1 404 Not Found] {"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index [my-second-index]","resource.type":"index_or_alias","resource.id":"my-second-index","index_uuid":"_na_","index":"my-second-index"}],"type":"index_not_found_exception","reason":"no such index [my-second-index]","resource.type":"index_or_alias","resource.id":"my-second-index","index_uuid":"_na_","index":"my-second-index"},"status":404}
, from which we can parse the value through a pattern, such as: ..."type":"<VALUE>"...
Alternatives:
org.elasticsearch.client.ResponseException.getResponse().getStatusLine().getReasonPhrase()
, for which the value corresponding the above error message would give: Not Found
I'd say parsing the JSON in the agent would be a no-go. The only two options I see are instrumenting the high-level client or not implementing this spec in the Java agent.
I'd say parsing the JSON in the agent would be a no-go
I agree. However, extracting from the message String what comes after "type":"
is reasonable I think both in terms of performance and accuracy.
The only two options I see are instrumenting the high-level client...
I don't agree 🙂 . I don't think that's an option we should consider. It means a plugin only for this purpose.
... or not implementing this spec in the Java agent
Definitely not a big deal. However, I do think that if you change the spec to say we append anything that is error-type-specific (rather than saying it needs to be error.type
coming from response body) AND consistent to the error object type, then using ResponseException.getResponse().getStatusLine().getReasonPhrase()
is a valid option.
However, I do think that if you change the spec to say we append anything that is error-type-specific (rather than saying it needs to be
error.type
coming from response body) AND consistent to the error object type, then usingResponseException.getResponse().getStatusLine().getReasonPhrase()
is a valid option.
That'll be the set of generic HTTP status code reason phrases, IIUC. That would be less useful to the user than the error.type
, which IIUC, maps mostly to the set of ElasticsearchException subclasses in Elasticsearch. This very unscientific grep shows a lot of grouping of various error.type
values into fewer HTTP response status codes:
[11:03:50 trentm@pink:~/el/elasticsearch (git:master)]
% rg 'return RestStatus\.' | cut -d':' -f2 | awk '{$1=$1;print}' | sort | uniq -c | sort -n
1 return RestStatus.MULTI_STATUS;
1 return RestStatus.fromCode(esse.getResponse().getStatusLine().getStatusCode());
1 return RestStatus.status(
1 return RestStatus.status(shardStats.successfulShards, shardStats.totalShards, shardStats.shardFailures);
1 return RestStatus.status(successfulShards, totalShards, shardFailures);
1 return RestStatus.valueOf(in.readString());
3 return RestStatus.CONFLICT;
3 return RestStatus.TOO_MANY_REQUESTS;
4 return RestStatus.ACCEPTED;
4 return RestStatus.OK.getStatus() == response.getStatusLine().getStatusCode();
8 return RestStatus.INTERNAL_SERVER_ERROR;
8 return RestStatus.SERVICE_UNAVAILABLE;
10 return RestStatus.NOT_FOUND;
22 return RestStatus.OK;
31 return RestStatus.BAD_REQUEST;
I don't have a sense of the Java agent perf impact, but I'd tend to prefer @eyalkoren's extracting from the message String what comes after "type":"
idea.
That would be less useful to the user than the error.type
Isn't this more for the purpose of increasing error type cardinality, rather than making it more user-readable?
Isn't this more for the purpose of increasing error type cardinality, rather than making it more user-readable?
Certainly separating error groups based on HTTP status code would increase the cardinality some. However, Elasticsearch necessarily lumps a number of different error types together. For example, all of these roughly 20-30 Elasticsearch exceptions are lumped together into RestStatus.BAD_REQUEST
, so would be in the same error group:
[~/el/elasticsearch (git:master)]% rg 'return RestStatus\.BAD_REQUEST' | cut -d: -f1 | xargs -n1 basename | cut -d. -f1
ParsingException
FoldingException
PlanningException
VerificationException
ParsingException
MappingException
ParsingException
PlanningException
VerificationException
FeatureNotEnabledException
ResourceAlreadyExistsException
InvalidIndexTemplateException
IndexClosedException
InvalidIndexNameException
InvalidTypeNameException
InvalidAliasNameException
ExceptionsHelper
ExceptionsHelper
ParsingException
SearchParseException
QueryShardException
DocumentSourceMissingException
StrictDynamicMappingException
MapperParsingException
RoutingMissingException
ScriptException
SnapshotInProgressException
TcpTransport
InvalidSnapshotNameException
ElasticsearchParseException
TaskCancelledException
The original issue description was "Different Elasticsearch errors end up with identical error.grouping_key" and showed two error examples that both had a 404
status code, FWIW. Not that the Java implementation needs to deal with that particular case, however.
Makes sense, thanks for explaining.
Either way, I'd be more comfortable with doing a lightweight dumb search for "type":"
, supporting most but not all use cases, if we use MAY
instead of SHOULD
. WDYT?
BTW, I had a discussion with the clients team and it seems they are interested in embedding custom APM API usage, at least for the Java client, which should provide the most accurate and always up-to-date instrumentation. I proposed OTel of course, but we will probably need to extend our OTel attribute mappings in order to support something like that, for example through io.opentelemetry.api.trace.Span#recordException(Throwable, Attributes)
.
if we use MAY instead of SHOULD. WDYT?
I am definitely fine with that.
BTW, I had a discussion with the clients team and it seems they are interested in embedding custom APM API usage, at least for the Java client...
Oh, interesting. The JS client has "observability" or "diagnostic" events that are meant to assist with instrumentation: https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/observability.html Unfortunately, the Node.js APM agent stopped using those a while back because it wasn't possible to get the async-context handling correct with those. We are back to monkey-patching the Elasticsearch client.
My understanding is that the Node community will likely not directly add depedencies on the OTel API for embedding custom instrumentation -- mainly because the @opentelemetry/api
package is too large. Instead, Node.js core has a newish, experimental, lightweight thing called diagnostic_channels
that modules can use. The idea is that APM agents can do instrumentation using those channels provided by modules.
IIUC, OpenTelemetry is currently extending its Context API to support being able to handle async-context properly using diagnostic_channels
: https://github.com/open-telemetry/opentelemetry-js-contrib/issues/1021#issuecomment-1126891080
See https://github.com/elastic/apm-agent-nodejs/issues/2770 for earlier discussion. Currently when an Elasticsearch client returns an API error -- i.e. an error from Elasticsearch itself -- the captured APM error object (at least for the Node.js agent) has a generic
error.exception.type == "ResponseError"
and aerror.exception.stacktrace
that is identical for different types of Elasticsearch API errors. The result is that all ES errors (e.g.resource_not_found_exception
,illegal_argument_exception
) are all grouped together in the APM app (see the APM server logic for settingerror.grouping_key
).The solution discussed above and implemented in the Node.js APM agent is to set
error.exception.type
to$exceptionType (${esResponseBody.error.type})
for ES API errors.checklist
sanitize_field_names
)CODEOWNERS
)