Closed NicolasMrad closed 1 year ago
@adriansmares @johanstokking @KrishnaIyer @htdvisser
I'm thinking that this would either involve the usage of something like OpenTelemetry or the usage of the existent metrics pkg which generates Prometheus data. I wanted to get everyone's opinion on what data should be handled.
I imagine the main metrics would be something related to:
I'm also assuming that the idea is to have the data accessible in a manner that is possible to discern the data from routes and methods. Having the data of a route described but with the possibility of specifying/requesting data from a specific method in that route, like:
/route
-> overall metrics
methodGet
-> metrics regarding this method.methodPut
-> metrics regarding this method.RegistryInteraction
-> data regarding this section.As an observation I'm not familiar with OpenTelemetry and therefore don't know for certain how feasible is to have the data modeled in this manner, I think it should be somewhat attainable since the docs describe the idea of the data collection being span based.
Besides the data to be stored in the telemetry, there is the topic regarding its usage. Should it be enabled by default? I personally think that it should be disabled in the default config but enabled in the community version, that way people running TTN on their own server don't have to worry about this.
Current Situation
No telemetry exists
This is not true at all. There is already a lot of telemetry in The Things Stack. See https://www.thethingsindustries.com/docs/reference/telemetry/ for details.
From what I understand, what's asked here is to push some of that telemetry to some global service that aggregates this information across all deployments of The Things Stack (depending on opt-in / opt-out of course).
Before we start thinking about the implementation, I think the most important question is what exactly needs to be shared and why?. Because in my opinion most of the telemetry that The Things Stack currently collects (in deployments other than our own) is of no interest to us (and most of it is also none of our business).
Why do we care about request time and memory usage in deployments other than our own? We have nothing to do with the SLAs of those deployments, nor the operations and scaling of their servers. Errors and panics from other deployments can already be shared with us by configuring our Sentry DSN, but what errors do we expect to catch that we won't already catch in our own deployments?
So let's first come up with a short list of maybe 10 metrics, where we clearly explain what is measured and why.
I might have misunderstood the initial issue to be discussed and indeed the question that you pointed out makes more sense.
Okay I wasn't entirely clear when I asked @NicolasMrad to file an issue.
What I meant is telemetry about how The Things Stack is used as self-managed deployment. That gives us, the maintainers, insight in how the product is used. We might want to make this opt-out on two levels: collecting telemetry (for us) and being part of public aggregated telemetry (for everyone).
Examples say more than a thousand words:
For some more inspiration, Syncthing has some really nice public aggregated telemetry: https://data.syncthing.net/
Some things that I would be interested in:
Right, these suggestions also clearly demonstrate the value that this sort of telemetry brings to the maintainers.
Just to be clear and for future reference: we are not going to collect any personal identifiable information, or any user data in general, and the purpose will never be to reach out with commercial offers. If we do the latter, it would be opt-in when "registering" the TTS deployment and signing up for promotions.
Implementation wise, I think we should first consider if we can use prometheus interfaces for this. This way, we would only need to implement a prometheus exporter to upload certain metrics to an API endpoint. The exporter would filter existing metrics by a list of telemetry metrics. This way, we leverage existing metrics and we can easily add new metrics without introducing a new typesystem for that.
@htdvisser @nicholaspcr what do you think?
Implementation wise, I think we should first consider if we can use prometheus interfaces for this. This way, we would only need to implement a prometheus exporter to upload certain metrics to an API endpoint. The exporter would filter existing metrics by a list of telemetry metrics. This way, we leverage existing metrics and we can easily add new metrics without introducing a new typesystem for that.
Implementation wise I think is indeed what makes the most sense. I'll look a bit more into creating the prometheus exporter this week and try to think if there is anything that could cause a problem but nothing comes to mind at the moment.
I also like quite a bit the idea of monitoring feature usage that @htdvisser gave. In regards to other metrics that might generate this sort of insight, I'm still researching how other OS projects do it, if I find other ideas I'll write them here to get everyone's opinion.
I don't know if piggybacking on Prometheus is the best approach. Prometheus counters are raw data, and I think the type of telemetry we want to have should already be aggregated to some extent. I also don't think we want to end up in a situation where we can't change our operational metrics (from Prometheus to OpenTelemetry) without breaking the aggregated telemetry.
So let's indeed take a look at what's out there and what solutions other open source projects use.
@NicolasMrad can you set up a meeting to discuss this further?
Updating the issue with what was discussed in the meeting at 8/09/2022
Regarding the data to be collected in the open source, the objective is to be somewhat concise with what is being collected.
Explanation of terms used in the list above:
registered
means simply that exists in the database.activated
means that the DB has the field active
as true
.active
means that the value of the last_seen
/last_updated
is relatively recent.The data described should be somewhat simple to fetch, meaning it should be on the IS or easy to fetch from already existent methods provided by the stack or the standard library.
In regards to the implementation of the information collector, it was suggested to try to make it more maintainable by using AWS lambda functions and other functionalities, instead of managing the container of the new application. More details regarding the collector will be added to the issue later after I read more on the subject.
The data described should be somewhat simple to fetch, meaning it should be on the IS or easy to fetch from already existent methods provided by the stack or the standard library.
I don't think we should limit telemetry collection to IS. I think we should have component registerers (like we have for services) that can produce arbitrary key/value telemetry in their component namespace. For example, a key metric is number of gateways connected which isn't observable by IS.
In regards to the implementation of the information collector, it was suggested to try to make it more maintainable by using AWS lambda functions and other functionalities, instead of managing the container of the new application. More details regarding the collector will be added to the issue later after I read more on the subject.
This shouldn't be platform specific. I don't understand why it would be more maintainable if we have deployment specific runners if we already have task infrastructure in TTS. TTSOS should be able to produce telemetry and upload it to us.
TTS can run in multiple instances. Even though we don't document that for OS, it is certainly possible to have separate containers for TTS components. With TTSE this is more common. These would all produce their own telemetry, and we must be able to correlate this to one cluster. How do we do that?
Today we don't have a unique key to correlate instances to one cluster. For TTSE we could hash the license key. For TTSOS we may not bother with this too much (as we don't document it) and maybe correlate to origin IP.
I don't think we should limit telemetry collection to IS. I think we should have component registerers (like we have for services) that can produce arbitrary key/value telemetry in their component namespace. For example, a key metric is number of gateways connected which isn't observable by IS.
Agree. This is probably me not being able to convey properly what was talked on the meeting, since I remember that the idea of these topics are to be a base, a start point of some sort, for the implementation of the telemetry in the OS. These metrics should be somewhat easy to collect, that's why the initial focus on the IS related metrics.
In regards to the implementation of the information collector, it was suggested to try to make it more maintainable by using AWS lambda functions and other functionalities, instead of managing the container of the new application. More details regarding the collector will be added to the issue later after I read more on the subject.
This shouldn't be platform specific. I don't understand why it would be more maintainable if we have deployment specific runners if we already have task infrastructure in TTS. TTSOS should be able to produce telemetry and upload it to us.
The implementation in this case would be of the data receiver (poorly described as the collector in my previous comment). The metrics would be generated by TTS and sent to the receiver which would be a lambda function ( still have to read on this ).
TTS can run in multiple instances. Even though we don't document that for OS, it is certainly possible to have separate containers for TTS components. With TTSE this is more common. These would all produce their own telemetry, and we must be able to correlate this to one cluster. How do we do that?
Today we don't have a unique key to correlate instances to one cluster. For TTSE we could hash the license key. For TTSOS we may not bother with this too much (as we don't document it) and maybe correlate to origin IP.
Making the uniqueID a hash of the config URL would indeed make the metrics non attachable to a cluster in OS, since we don't have a unique value which is shared between each component and distinct to other deployments. I imagine that the amount of people that have their own deployment of each component of TTS is marginal, I think it's appropriate to keep the hash of config URLs idea on OS and use the license key hash for ES.
Writing in here to update the status of the issue.
With the #6021 PR approved the defined fields are collected in the Stack and the CLI. The missing steps for the full flow of telemetry collection is currently as follows:
entity_count
which should have the date
as a secondary index.date
as an index).Closing this issue in favour of the one present in the product management repository.
The last updated made on march is not the current implementation as the Daily Sweeper
and Graph generator
were discarded by the preference to use TimestreamDB and Grafana. The issue linked on the management repository are more up to date and therefore will serve as a better umbrella issue.
Ref: https://github.com/TheThingsIndustries/product-management/issues/11
@nicholaspcr: Please remember to link the issue that an issue replaces for tracking.
I didn't reference the comment because there was a link pointing to it right above. Nevertheless, next time, I will reference the issue.
Summary
We need to implement a well documented telemetry collection on OS. The data to be collected is usage related and not personal.
Current Situation
No telemetry exists
Why do we need this? Who uses it, and when?
To gain some insight on the usage of the stack by OS users.
Proposed Implementation
Components need to be able to provide their own telemetry separately (so GS, NS, AS, IS, JS separately) but it should be combined in one call to an API endpoint. Again only usage data will be collected and no personal (PII) data will be stored or collected. The opt-out option should be well documented, and once implemented this should be highlighted in the changelog. This needs to be implemented in a minor release.
Contributing
Code of Conduct