jakedt commented 3 years ago

Telemetry Proposal

As a small team developing very high performance software, we're constantly prioritizing between improving features, stability, performance, and user experience. While we obviously have metrics from our hosted SpiceDB instances on Authzed.com, our last product taught us that open-source and enterprise users often use the software in surprisingly different ways. In order to develop a tight feedback loop with our users, we would like to add some opt-out telemetry information to SpiceDB. As big fans of open source, and as heavy users ourselves, we understand that users can be sensitive to data collection and exfiltration efforts by the software they run. That's why it is our goal to be as open and transparent about this process as possible.

Philosophical Goals

[ ] Put a file called TELEMETRY.md in the root of the repository that includes the final form of this proposal and easy instructions for disabling telemetry
[ ] Users will be able to see the exact data that is collected and shipped at all times
[ ] It will be simple to disable telemetry, although it will remain opt-out to reduce response bias in the results.
[ ] Users will be notified at the INFO log level every time an instance of SpiceDB starts that telemetry is enabled
[ ] A notification will be written at the INFO log level every time telemetry data is sent
[ ] Metrics collection should not impact the performance of latency sensitive operations
[ ] Metrics will be anonymous and aggregate, we will not be able to track them back to a specific user.

Proposed Data Collection

Each of the following metrics includes the justification and the specific way in which we will use the data to measure and improve the software.

Running SpiceDB instances per installation

Knowing average cluster size will help us to direct resources to service discovery, clustering, and remote re-dispatch.

Distributed cache hit ratio

The Zanzibar model lives and dies by how effectively it utilizes the distributed cache. While we have our own tests and metrics, one or more user clusters underperforming would be an indicator that something is awry with our consistent hashing or data access assumptions.

Number of object definitions

In the Zanzibar paper, Google gave metrics about average schema size. A histogram would have been better! Schema complexity directly correlates to resolution complexity, and knowing that open-source uses are using more or less complex schemas than anticipated would help us direct resources toward nested query complexity.

Number of relationships

Similarly to schema complexity, the amount of data also controls the re-dispatch fan-out and resolution complexity. If schemas rely heavily on the arrow -> operator on very large datasets, this would lead us to invest in improvements in resolution order and heuristics.

Number of redispatches/subproblems per operation

A metrics that is the unification of data and schema, this is a direct, hardware-independent measurement of resolution complexity, and would direct investments similarly to schema and data complexity.

Number of calls (but not latencies) to specific APIs

The Zanzibar paper gives the call frequencies for certain operations, but does not tell the complete story. In the Zanzibar paper, Read is used more than Check, and Zanzibar does not support Lookup at all. In order to make sure we're investing in improvements to each method appropriately, it is important to understand the call-frequency usage patterns.

Considered and Rejected

It is often as important to know what was considered and rejected as it is to know what was included in the final proposal.

Rejected: Collecting API latency metrics

This is extremely infrastructure dependent, and no useful information could be gleaned from it in aggregate. Hardware independent complexity measures are preferred as a result.

Rejected: User driven redaction of specific metrics

While this sounds interesting at the outset, having an incomplete picture of the metrics from each SpiceDB installation could be statistically misleading. For example, knowing the cache hit ratio but not the schema complexity would make it hard to know if there is a data issue or schema issue.

Rejected: Opt-in metrics

While this is obviously very user-friendly, we're all aware of the problems of response bias in statistics. We may end up with an entirely different class of user choosing to report metrics than the average. This may skew efforts in the wrong direction. For example, if only enterprises opt-in to the data collection, we may completely overlook problems with the software that arise during the small-scale development phase.

Open Questions

What data pipeline should we use to collect metrics? We use Prometheus for everything else, but this almost necessarily needs to be push-centric, which Prometheus cautions against.
Are users comfortable with us enlisting the help of a sub-processor, such as Mixpanel, Amplitude, Google Analytics, etc. for tracking and reporting the data that we collect?

jzelinskie commented 3 years ago

Another open question is how the collected data will be shared with the community. Maintainers only? A public dashboard?

bwplotka commented 3 years ago

Why it has to be push centric? 🤗

jakedt commented 3 years ago

Why it has to be push centric? 🤗

@bwplotka Honestly because of nat traversal. It's fairly easy for us to create an endpoint that most™ machines would be able to talk to. There is a prometheus nat traversal thing: https://github.com/prometheus-community/PushProx

We should add that to the list of things to consider.

jamtur01 commented 2 years ago

+1 to push - much easier to manage from a security (and approval) perspective over punching holes inbound in the perimeter.

jakedt commented 2 years ago

Fixed by #515

authzed / spicedb

Proposal: SpiceDB telemetry #225