Open jparsana opened 6 years ago
in brave we have an unmerged on account of not enough popularity "localRoot" thing which would allow consistent squashing of intermediate spans. We also have a completion hook which allows things like instrumentation that verbosely "log" into a span.. allows you to delete all of that junk. storage is almost always a function of size and indexed fields. Another thing to think about is unindexed fields. for example, amazon specifically calls out indexed fields vs not indexed ones. This can help in systems like that. Ex you wouldn't want to index json pooped into your span.
The majority of complaints that have been relayed to me are about metrics. There are a few things that I think we can do:
@rakyll @ahmetb any other ideas?
@odeke-em can you help with implementing this ?
cc @shahprit / @bogdandrutu
Thanks for the tag @jparsana!
Reduce the number of metrics captured by default. Today we capture about twelve and we should reduce this to just latency and errors.
@mtwo I think sizes are very important too but we can collapse "sent_bytes" and "received_bytes" into one and then use a tag say *"_type" which will be either of "received" or "sent"
We can fold "sent_messages_per_method" and "received_messages_per_method" and "sent_messages_per_rpc" and "received_messages_per_rpc" to "messages" and have two addition tags "type": which can be either of "received" or "sent" and then "method" as "per_method" seems like a subset of "per_rpc". And same thing for "completed_rpcs" and "started_rpcs" those can be folded into "rpcs" and we can use a tag "state" which will either be "started" or "completed"
This would reduce them number of metrics by ~64% i.e. 4 metrics instead of 11 and we shall be enumerating this for gRPC and HTTP.
For an illustration, please see below this for gRPC server views:
Before |
---|
grpc.io/server/received_messages_per_rpc
grpc.io/server/sent_messages_per_rpc
grpc.io/server/sent_messages_per_method
grpc.io/server/received_messages_per_method
grpc.io/server/sent_bytes_per_method
grpc.io/server/received_bytes_per_method
grpc.io/server/received_bytes_per_rpc
grpc.io/server/sent_bytes_per_rpc
grpc.io/server/server_latency
grpc.io/server/completed_rpcs
grpc.io/server/started_rpcs
After | Additonal/New Tags |
---|---|
grpc.io/server/bytes or grpc.io/server/data_transferred | type: either of ["received", "sent"] and "grpc_method" |
grpc.io/server/messages or grpc.io/server/messages_transferred | "type": either of ["received", "sent"] and "grpc_method" which can be anything * |
grpc.io/server/server_latency | |
grpc.io/server/rpcs | "state": either of ["started", "completed"] and "grpc_method" which can be anything * |
Of course recording rpc counts with different tags at different states "started" then "completed" might be non-ideal (i.e. record 2x times on the same metric) so perhaps we can reconsider reverting this suggestion to keep "completed_rpcs" and "started_rpcs" separate?
Please let me know what you think.
This would reduce them number of metrics by ~64% i.e. 4 metrics instead of 11 and we shall be enumerating this for gRPC and HTTP.
NB: combining metrics but partitioning them by tags does not reduce the number of unique time series produced, which is often what customers of APM tools are paying for.
NB: combining metrics but partitioning them by tags does not reduce the number of unique time series produced, which is often what customers of APM tools are paying for.
Thanks for checking up on my assumptions @yurishkuro! In deed, it doesn't reduce the number of unique time series produces(volume) which as you are right is the basis for charges. I did a calculation and found whether 10 or 20 metrics the cost would be the same as long as the volume is the same. I was thinking of the number of metrics being an extra cost.
Reduce the number of metrics captured by default. Today we capture about twelve and we should reduce this to just latency and errors. The remaining ones (response size, queue length, etc.) can then be optionally enabled.
This is the whole point of "stats" package. User that are concerned about the cost of the exported data should configure their own views instead of installing the default recommended "views". We can probably have a "minimum" recommended views. The fact that we record all these measures (latency, bytes in/out, etc.) does not affect the cost if nobody installs a view.
Increase the default metrics aggregation window to one minute
I think this is a good idea. We should make this change, the only problem that I saw in the past is that for example code we want users to see a quick result which probably requires having a smaller exporting interval for examples. Probably we should have warnings in all our example code when we do change something for users just to see a quick result.
@mtwo Please edit and add specifics of the details from your encounter on this.