Closed synctext closed 1 month ago
Interesting visualisations! Somewhat related to https://github.com/Tribler/tribler/issues/3508 (at least the TrustChain deployment monitoring).
e2e anonymous download is an excellent candidate for performance monitoring and should not take long to setup. I think @ichorid addressed this a while ago actually but it has not been actively monitored since then. In fact, making us (more) aware of failing tests/validation experiments is becoming a necessity as the number of different tests that run with fixed time intervals is growing.
I think we have to address this issue rather sooner than later. The problem is that if we do not do it, we will have a proliferation of different tools. Currently, we have the TrustChain explorer, Tribler user statistics, the error reporter and all tests/monitors on Jenkins. There might be some opportunity to merge some tools, which eases maintenance.
metadata status: keyword searches, channel gossip community
This might be a dangerous one to monitor and could be a violation of ones privacy expectations of Tribler.
Please look at FileCoin slipped roadmap. After Release 7.5 I'm considering that we work together on the first Jenkins dashboard for 2 weeks:
Can we decide on some software/library to use (or to make) to graph all of this data? All sorts of dashboard creation tools exist.
For example: https://dzone.com/articles/build-beautiful-console-dashboards-with-sampler
Most of this data can either be extracted from our existing Jenkins Job using the API, or from our running Trustchain explorer backend, also with API requests. One of the question we should also answer, is whether we want a dedicated website for this. Jenkins unfortunately does not provide the tools for such real-time data, and integration of this dashboard in Jenkins would just be a new job with succeed/fail status.
arrange hardware monitors with obscene awesomeness, due to size
We should secure a prominent spot at the coffee machine ☕️
I propose starting with something "easy". Exposing GitHub events through tribler.org
:
tribler.org
domain).tribler.org/githubevents
?) which renders all GitHub events (possibly with websockets for live updates).The idea is that we can reuse the resulting backend for another (bigger and better) dashboard and we'll have something to look at in the mean time.
One way to get more insights into our user count is by analysing the crawled TrustChain data. The plot below is generated based on our current dataset, with over 80.000 users and 123 million records. The (major) releases of Tribler are annotated. Note how our 7.5.0 release resulted in an increase in new user count.
Parsing this 97GB database, however, is computationally intensive and could be done on a daily basis for example. A dashboard could include this static image.
In 2006-2009 we had initial deployment monitoring. Included in Zeilemaker master thesis.
Based on data we already have
Yesterday I did a little research on this topic, and now I want to suggest a way to show anonymized performance statistics. It may be the following set of technologies:
The most popular tool for gathering and processing metrics is Prometheus. It has has a big community and is widely used for gathering server metrics. Prometheus if often compared with InfluxDB (see the comparison on official Prometheus doc). While Prometheus is more popular, in my opinion, InfluxDB is better suited to our needs for the following reasons:
1) Prometheus pulls metrics from the known number of server instances. In our case, we cannot pull statistics from client machines and want to push instead. While it is possible to use Prometheus with additional tools like Prometheus Aggregation Gateway, it in some way goes against Prometheus philosophy. On the other side, InfluxDB expects that the data are pushed, which is better suited to our needs.
2) Prometheus data storage is ephemeral and not intended to be stored for a long time. InfluxDB data are persistent and can be used to compare changes in gathered statistics on long time intervals.
Grafana is a very popular open-source tool for graph visualization, which can be used with Prometheus, InfluxDB, and multiple other data sources. It allows constructing powerful dashboards with different types of graphs and charts.
If we decide to use this set of tools, I think I can take on this task. I see the following sub-tasks here to be implemented: 1) a client-side code for preparing anonymized statistics 2) a client-side code to send gathered statistics to our dedicated server 3) a custom server API to collect anonymized statistics 4) a server code which implements API as mentioned above, aggregate collected data and put it into InfluxDB instance 5) deploy a dedicated server with statistics gathering API, deploy InfluxDB instance (probably on some different machine) 6) deploy Grafana instance 7) make Grafana dashboard
Later we can use Grafana to display all graphs, not only user statistics but also server builds, etc.
What do you think?
Pitfall: everything we want with our self-organising research project is easier to do in a central server... Primarily use our crawlers as early warning infrastructure! (IPv8 is designed for network health monitoring) Then we need to emphasise crawler intelligence and stats aggregation.
Are we not re-creating this from scratch? https://jenkins-ci.tribler.org/job/Test_BootstrapServers/lastSuccessfulBuild/artifact/walk_rtts.png
First, anonymity is our existential feature. How to do this? (True anonymity might be impossible, OFF switch by default) We could show the user inside the debug panel the exact history and record which will be shared in private with our debug servers optionally? Can we protect against Internet address leakage? Many steps in future I guess to re-usage our Tor-like stuff while debugging our Tor-like stuff :-)
This needs to be opt-in for production releases and can hopefully be opt-out for nightly builds and Beta versions. What about Release Candidates?
InfluxDB: 34,082 commits, 19.5k of stars on Github. This is a general time-series database solution, we still need to make custom code for deployment monitoring?
This seems quite complex tooling. Afraid of overengineering for the user community we have currently. However, deployment monitoring is something we really need to do more and get right.
InfluxDB and Graphana are indeed good choices.
1.Custom client-side code to prepare anonymized statistics 2.Dedicated server with custom API as an entry point 3.InfluxDB for storing anonymized data 4.Grafana for displaying beautiful graphs
I have done some work on 1 and 2. I'm extending https://release.tribler.org/docs to receive anonymized data from the client. That can be the entry point to further processing using InfluxDB and visualizing on Graphana.
We probably can use InfluxDB Jenkins plugin to put deployment statistics into the InfluxDB: https://wiki.jenkins.io/display/JENKINS//InfluxDB+Plugin
Change of plans:-) By 25 September aim to have plots in Jenkins. The PopularityCommunity is crawled and health statistics are refreshed every few minutes or half an hour. After this test project we determine what we need and roadmap. Could be a fix of the PopularityCommunity code plus algorithm as next step, deploy, monitor, etc.
Our current methodology:
Tribler is a bottomless pit of problems. (stolen quote) Our work methodology should become relentlessly data-driven: there is direct evidence we need better crawling, no evidence of client monitoring beyond debug screen and crash reporting (might change; agile)
exclusively rely on unit tests
I think a key metric is the stability of our unit tests. Currently, unstable unit tests (both on devel
and our release branches) are delaying the development process. Converting the test suite to pytest, which should make the debugging process of errors in the tests easier, is much more work than I anticipated.
My suggestion would be to continuously run all unit tests on a dedicated machine and include in the upcoming dashboard how stable they are (e.g., % of runs failing during the last day).
Related work: https://stats.goerli.net/
Impressive progress! Our .yml and servers are getting in much better shape. We can even see in real time the upgrade speed. Learned something new: they upgrade quite fast. Previous years we never had this.
Yeah! More pretty graphs, exit node peak: 121 GiB per second
cool
Example: https://data.syncthing.net/ File sync with central servers discovery and no spam measures. Great deployment monitoring!
@kozlovsky Could you please duplicate this specific https://data.syncthing.net/ graphs and wrap up the Grafana work?
This is quite a useful and simple graph to have.
Users Joining and Leaving per Day === This is the total number of unique users joining and leaving per day. A user is counted as "joined" on first the day their unique ID is seen, and as "left" on the last day the unique ID was seen before a two weeks or longer absence. "Bounced" refers to users who joined and left on the same day.
To better organise ourselves we need more critical information in 1 place.
Mature network alerts and deployment monitoring. The mission is to put everything in one place. The big danger is to partially put everything together, but actually create the n+1 place called Grafana where data is fragmented. Full user experience pipeline:
Tribler-7.10.0.dmg (60.20 MiB) - downloaded 3,241 times. Last updated on 2021-07-14
A single page having graphs for the health of each step in our user journey would help to identify faults. We learned a lot from our recent "unknown user drop" incident. Like: Took the team 5 days to figure out we had a suspicious memory dip at 06:00AM dailly.
When we have hired more developers we can re-visit this issue. We need to focus on putting everything inside application-tester and existing code. Example of IPFS people on DHT health.
IPFS people have nice uptime monitoring script (DHT only level):
Epic 2015 ticket with monitoring with Niels statistics. User community insight using an improved crawler
We take screenshots, takes a few clicks to find (application tester on Jenkins) Plus smooth Github actions: https://github.com/Tribler/tribler/actions/runs/3330189428/jobs/5508351618
Complex monitoring. Numerous statistics systems, all connected together, and almost all down now :crying_cat_face:
The network was not functioning optimal these days. The Tor-like network was running out of capacity. Root cause of failure was a memory leak which went unnoticed. Grafana did not alert. No Slack alarm post. Testers did not alert. InfluxDB is not recording anymore. Prometheus-Grafana data feed is down. Dream of a single dashboard with health should have caught this. Another system brought live in a few hours:
This duplicates Jenkins monitoring: https://jenkins-ci.tribler.org/job/Test_BootstrapServers/lastSuccessfulBuild/artifact/summary.png We lack a single vision and minimal maintenance platform for alerts. ToDo after big release.
This is yet another indication that choosing Grafana+Prometheus may not have been the best decision for our "new" dashboard. We already have ample sources of information, so adding another unique source doesn't seem optimal. What we really need is a singular place to integrate all existing information.
From my perspective, here's what we should do (with a rough time estimation):
Our information sources:
(did I miss something?)
I think that of all the services we use for dashboards and monitoring (Prometheus, InfluxDB, Grafana), Prometheus is the most reliable (and can display monitoring graphs without Grafana), while the most problematic was InfluxDB; most dashboard outages were caused by it.
It may be worth spending time to set up Prometheus alerts, as it should cover most of the current problems.
For persistent time series data, the most convenient data storage may be TimescaleDB, which can replace InfluxDB and fix most problems.
But trying something simpler like Graphite is also possible.
Since Grafana is currently used for deployment monitoring and as far as I understand there is no immediate priority to work on an alternative, I'm unassigning myself from this ticket.
Indeed, we have a solution in place. This issue is - at the very least for now - resolved. If we have specific alternatives that we want to explore in the future, another issue can be opened.
To better organise ourselves we need more critical information in 1 place.
The coming time we aim to close #1 finally. Our progress towards this goal and how stable we are should be captured in a Tribler-at-a-Glance dashboard. Example from Jenkins:
https://medium.com/kj187/jenkins-job-dashing-widget-cc72feeed654
https://www.level-up.one/6-of-my-favorite-jenkins-plugins/
https://www.datadoghq.com/blog/monitor-jenkins-datadog/
Tribler critical information candidates: