Open ofek opened 6 years ago
I'd like to broaden this issue a bit since a collision has happened before here. In addition to correcting the collision above, we should add ASSERT
s to ensure they don't happen anywhere else. Also, because the tag system is configurable and we want to maintain its flexibility, maybe it should be possible to disable these ASSERT
s as well, but I'm less sure if that makes sense. Thoughts?
cc @mattklein123 @jmarantz @htuch
@ofek I'm confused how there is a collision here. I don't think there is. Can you expand?
@mrice32 sure we can do collision detection somehow, though TBH I'm not sure exactly how since the stats system is built such that multiple code sites can access the same stat. Can you open a fresh issue for discussing that?
@mattklein123 The tag-extracted metric names are the same, so there is no way to discern the meaning unless meaning is influenced by the number of tags, which is not right.
@mattklein123 See #2687.
@ofek yup OK sorry didn't realize this was a tag extraction collision. I will defer to @mrice32 and @taiki45 on how to resolve.
@jmarantz I think you were asking whether we should change the names of those stats or modify the extraction process to fix this collision instance. tl;dr Without changing the names of the stats themselves, I'm not sure how to fix this instance.
I think it will be difficult to use the same sort of regex fix that we used in #2141 here because that took advantage of Envoy-defined structure in the tag value. The tags here are user defined, so we can't assume any structure for them. The current system doesn't allow tag extraction to add anything to the name - only remove, which limits our mitigation options here.
So I still don't really understand the bounds of the solution space.
What if the tag extraction process didn't merely remove the tag values, but replaced them with the values with the names of the extracted tags? Of course that would be a stats_impl change that would require some effort, but I don't know who is dependent on the current exact syntax of the tag-extracted names.
@jmarantz That's an interesting idea. I have no problem with something like that, although IMO, it makes the names a little less intuitive to read if they're being directly ingested by other stats systems. I'll pull in a few folks who have worked on integrating with other stats systems to hear their thoughts: @srvaroa @taiki45 @JonathanO @lita. Feel free to pull in others who may have opinions here.
@jmarantz @mrice32 Can you please give an example of what that might look like? Also fyi, @DataDog will (using the /stats
endpoint) soon depend on tag-extracted names being unique metric names.
@ofek An example of @jmarantz's proposed solution is that http.my_custom_prefix.fault.delays_injected
resolves to http.envoy.http_conn_manager_prefix.fault.delays_injected
as its tag extracted name. This means that collisions will essentially never occur.
Also, both solutions will affect user visible stats outputs, so I'm curious what Envoy's change policy is with respect to stats? Should we go through a deprecation cycle where we have the old name and new name coexist for a version?
@mrice32 @jmarantz Using the /stats
endpoint, how could one derive my_custom_prefix
from envoy.http_conn_manager_prefix
? If you can't, then the output would be more or less useless to consumers of the stats API.
@ofek http.envoy.http_conn_manager_prefix.fault.delays_injected
would be the new tag extracted name (this example would resolve to http.fault.delays_injected
today). The vector of tags and tag names would still accompany the tag extracted name as they do today, so envoy.http_conn_manager_prefix
would be listed as a tag in that vector with a tag_value of my_custom_prefix
.
Ah, now I understand. Yes that envoy
prefix indication of the tag would be great.
It would be really, really useful if the json output had a format like:
{
"name": "http.fault.delays_injected",
"value": 0,
"tags": {
"stat_prefix": "my_custom_prefix"
}
}
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.
Are collisions now impossible?
Yeah, I don't think this has changed. Reopened and tagged "help wanted". @jmarantz is there any suggested resolution on your side?
Description:
http.<stat_prefix>.fault.delays_injected
andhttp.<stat_prefix>.fault.<downstream-cluster>.delays_injected
both resolve tohttp.fault.delays_injected
http.<stat_prefix>.fault.aborts_injected
andhttp.<stat_prefix>.fault.<downstream-cluster>.aborts_injected
both resolve tohttp.fault.aborts_injected
https://www.envoyproxy.io/docs/envoy/latest/configuration/http_filters/fault_filter#statistics
edit:
I found one more https://www.envoyproxy.io/docs/envoy/latest/configuration/cluster_manager/cluster_stats.html#alternate-tree-dynamic-http-statistics
Those resolve to the same as https://www.envoyproxy.io/docs/envoy/latest/configuration/cluster_manager/cluster_stats.html#dynamic-http-statistics