census-instrumentation / opencensus-specs

Apache License 2.0
188 stars 50 forks source link

Brainstorm ways to reduce cost on various backends for http and grpc #211

Open jparsana opened 6 years ago

jparsana commented 6 years ago

@mtwo Please edit and add specifics of the details from your encounter on this.

codefromthecrypt commented 6 years ago

in brave we have an unmerged on account of not enough popularity "localRoot" thing which would allow consistent squashing of intermediate spans. We also have a completion hook which allows things like instrumentation that verbosely "log" into a span.. allows you to delete all of that junk. storage is almost always a function of size and indexed fields. Another thing to think about is unindexed fields. for example, amazon specifically calls out indexed fields vs not indexed ones. This can help in systems like that. Ex you wouldn't want to index json pooped into your span.

mtwo commented 5 years ago

The majority of complaints that have been relayed to me are about metrics. There are a few things that I think we can do:

@rakyll @ahmetb any other ideas?

jparsana commented 5 years ago

@odeke-em can you help with implementing this ?

cc @shahprit / @bogdandrutu

odeke-em commented 5 years ago

Thanks for the tag @jparsana!

Reduce the number of metrics captured by default. Today we capture about twelve and we should reduce this to just latency and errors.

@mtwo I think sizes are very important too but we can collapse "sent_bytes" and "received_bytes" into one and then use a tag say *"_type" which will be either of "received" or "sent"

We can fold "sent_messages_per_method" and "received_messages_per_method" and "sent_messages_per_rpc" and "received_messages_per_rpc" to "messages" and have two addition tags "type": which can be either of "received" or "sent" and then "method" as "per_method" seems like a subset of "per_rpc". And same thing for "completed_rpcs" and "started_rpcs" those can be folded into "rpcs" and we can use a tag "state" which will either be "started" or "completed"

This would reduce them number of metrics by ~64% i.e. 4 metrics instead of 11 and we shall be enumerating this for gRPC and HTTP.

For an illustration, please see below this for gRPC server views:

Server views

Before

grpc.io/server/received_messages_per_rpc grpc.io/server/sent_messages_per_rpc grpc.io/server/sent_messages_per_method grpc.io/server/received_messages_per_method grpc.io/server/sent_bytes_per_method grpc.io/server/received_bytes_per_method grpc.io/server/received_bytes_per_rpc grpc.io/server/sent_bytes_per_rpc grpc.io/server/server_latency grpc.io/server/completed_rpcs grpc.io/server/started_rpcs

After Additonal/New Tags
grpc.io/server/bytes or grpc.io/server/data_transferred type: either of ["received", "sent"] and "grpc_method"
grpc.io/server/messages or grpc.io/server/messages_transferred "type": either of ["received", "sent"] and "grpc_method" which can be anything *
grpc.io/server/server_latency
grpc.io/server/rpcs "state": either of ["started", "completed"] and "grpc_method" which can be anything *

Of course recording rpc counts with different tags at different states "started" then "completed" might be non-ideal (i.e. record 2x times on the same metric) so perhaps we can reconsider reverting this suggestion to keep "completed_rpcs" and "started_rpcs" separate?

Please let me know what you think.

yurishkuro commented 5 years ago

This would reduce them number of metrics by ~64% i.e. 4 metrics instead of 11 and we shall be enumerating this for gRPC and HTTP.

NB: combining metrics but partitioning them by tags does not reduce the number of unique time series produced, which is often what customers of APM tools are paying for.

odeke-em commented 5 years ago

NB: combining metrics but partitioning them by tags does not reduce the number of unique time series produced, which is often what customers of APM tools are paying for.

Thanks for checking up on my assumptions @yurishkuro! In deed, it doesn't reduce the number of unique time series produces(volume) which as you are right is the basis for charges. I did a calculation and found whether 10 or 20 metrics the cost would be the same as long as the volume is the same. I was thinking of the number of metrics being an extra cost.

bogdandrutu commented 5 years ago

Reduce the number of metrics captured by default. Today we capture about twelve and we should reduce this to just latency and errors. The remaining ones (response size, queue length, etc.) can then be optionally enabled.

This is the whole point of "stats" package. User that are concerned about the cost of the exported data should configure their own views instead of installing the default recommended "views". We can probably have a "minimum" recommended views. The fact that we record all these measures (latency, bytes in/out, etc.) does not affect the cost if nobody installs a view.

Increase the default metrics aggregation window to one minute

I think this is a good idea. We should make this change, the only problem that I saw in the past is that for example code we want users to see a quick result which probably requires having a smaller exporting interval for examples. Probably we should have warnings in all our example code when we do change something for users just to see a quick result.