Closed jakedt closed 2 years ago
Another open question is how the collected data will be shared with the community. Maintainers only? A public dashboard?
Why it has to be push centric? 🤗
Why it has to be push centric? 🤗
@bwplotka Honestly because of nat traversal. It's fairly easy for us to create an endpoint that most™ machines would be able to talk to. There is a prometheus nat traversal thing: https://github.com/prometheus-community/PushProx
We should add that to the list of things to consider.
+1 to push - much easier to manage from a security (and approval) perspective over punching holes inbound in the perimeter.
Fixed by #515
Telemetry Proposal
As a small team developing very high performance software, we're constantly prioritizing between improving features, stability, performance, and user experience. While we obviously have metrics from our hosted SpiceDB instances on Authzed.com, our last product taught us that open-source and enterprise users often use the software in surprisingly different ways. In order to develop a tight feedback loop with our users, we would like to add some opt-out telemetry information to SpiceDB. As big fans of open source, and as heavy users ourselves, we understand that users can be sensitive to data collection and exfiltration efforts by the software they run. That's why it is our goal to be as open and transparent about this process as possible.
Philosophical Goals
TELEMETRY.md
in the root of the repository that includes the final form of this proposal and easy instructions for disabling telemetryINFO
log level every time an instance of SpiceDB starts that telemetry is enabledINFO
log level every time telemetry data is sentProposed Data Collection
Each of the following metrics includes the justification and the specific way in which we will use the data to measure and improve the software.
Running SpiceDB instances per installation
Knowing average cluster size will help us to direct resources to service discovery, clustering, and remote re-dispatch.
Distributed cache hit ratio
The Zanzibar model lives and dies by how effectively it utilizes the distributed cache. While we have our own tests and metrics, one or more user clusters underperforming would be an indicator that something is awry with our consistent hashing or data access assumptions.
Number of object definitions
In the Zanzibar paper, Google gave metrics about average schema size. A histogram would have been better! Schema complexity directly correlates to resolution complexity, and knowing that open-source uses are using more or less complex schemas than anticipated would help us direct resources toward nested query complexity.
Number of relationships
Similarly to schema complexity, the amount of data also controls the re-dispatch fan-out and resolution complexity. If schemas rely heavily on the arrow
->
operator on very large datasets, this would lead us to invest in improvements in resolution order and heuristics.Number of redispatches/subproblems per operation
A metrics that is the unification of data and schema, this is a direct, hardware-independent measurement of resolution complexity, and would direct investments similarly to schema and data complexity.
Number of calls (but not latencies) to specific APIs
The Zanzibar paper gives the call frequencies for certain operations, but does not tell the complete story. In the Zanzibar paper,
Read
is used more thanCheck
, and Zanzibar does not supportLookup
at all. In order to make sure we're investing in improvements to each method appropriately, it is important to understand the call-frequency usage patterns.Considered and Rejected
It is often as important to know what was considered and rejected as it is to know what was included in the final proposal.
Rejected: Collecting API latency metrics
This is extremely infrastructure dependent, and no useful information could be gleaned from it in aggregate. Hardware independent complexity measures are preferred as a result.
Rejected: User driven redaction of specific metrics
While this sounds interesting at the outset, having an incomplete picture of the metrics from each SpiceDB installation could be statistically misleading. For example, knowing the cache hit ratio but not the schema complexity would make it hard to know if there is a data issue or schema issue.
Rejected: Opt-in metrics
While this is obviously very user-friendly, we're all aware of the problems of response bias in statistics. We may end up with an entirely different class of user choosing to report metrics than the average. This may skew efforts in the wrong direction. For example, if only enterprises opt-in to the data collection, we may completely overlook problems with the software that arise during the small-scale development phase.
Open Questions