Investigate driving grafana directly from postgres instead of influxdb

spiffxp commented 6 years ago

Followup from the devstats session I attended at KubeCon EU. This is driven in part by me experimenting with Postgres->Grafana directly on my system at a small scale with the full 'gha' database for kubernetes/kubernetes. Curious whether it'd be possible to do this at full scale.

We're starting to get interested in "second order" stats, stats that we can derive from the existing stats. Trends in one dimension can be interesting, but correlating them with trends in other dimensions may be more actionable. The current InfluxDB datasource is very limiting for doing anything beyond selecting a timeseries. ie: we can't compare timeseries, we can't add/subtract/etc.

The InfluxDB instance seems to be there just to hold computed time series metrics, but I'm not sure whether we're getting a lot of value out of it. I'm wondering if it'd be easier to use Postgres to host the computed time series metrics. The grafana postgres datasource could be used to allow combining time series metrics. There may even be some straightforward non-computed metrics that are possible to serve directly from PostgreSQL.

If not PostgreSQL, Prometheus or Graphite would also be possibilities for timeseries and math. However I'm not sure whether they would be useful for tabular or textual data.

cc @lukaszgryglicki

lukaszgryglicki commented 6 years ago

@dankohn already approved this. I'll check in this order:

Postgres as a time series, possibly even on the same database as current GHA database. Just "series" and "tags" table will need to be added, or even just a single table containing all-time series. I would rather avoid creating separate tables for each series name (something, say, well similar to influx series). Series names change often, I would make them column something like series_name. Potential problems with tags and multi-value series with N-values aka/columns and selecting them via regexp from Grafana.
If and only if Postgres approach fails (which I believe won't happen) - I'll investigate Prometheus or Graphite.

Will work on this next week, hopefully, have answers until next week Saturday.

I really hope that Postgres approach would work, so we will drop one item from our graph.

BTW: we are already using grafana postgres datasource for various per-project definitions. See gha_vars.

lukaszgryglicki commented 6 years ago

After reading this I'm almost sure that we can go with Postgres.

lukaszgryglicki commented 6 years ago

@spiffxp you should observe this branch. I've worked on this about 8 hours today - all seem to be very promising for PSQL as a TS DB.

lukaszgryglicki commented 6 years ago

One day POC seems to work, but it uses Giant LOCK approach just to check if we can proceed. If it proves useful and is capable of everything DevStats wants, I'll take a lot more time to get rid of giant lock.

lukaszgryglicki commented 6 years ago

Now I have all TS data in PSQL< but seems like Grafana is not yet stable at displaying data from Postgres, see this: https://k8s.cncftest.io/d/1/activity-repository-groups?orgId=1

Data on postgres is exactly the same as it was on influx: screen shot 2018-05-06 at 09 07 11

But Grafana displays one big mess.

lukaszgryglicki commented 6 years ago

All data on this chart comes from Postgres now. So annotations, template variables, drop-downs, chart. This proves that we can use Postgres as a data source. Now I'm generating data for ALL charts, then I will have to update all of them to use different data source and different sysntax for everything everywhere. Postgres TS database is the same as GHA database - just a single one that contains all data. You can aceess it from ssh bastion.

dankohn commented 6 years ago

Please report on performance differences when you're ready. How long to reprocess all data in each approach? How long to do the hourly updates? How long to run the queries to generate some typical pages?

-- Dan Kohn dan@linuxfoundation.org Executive Director, Cloud Native Computing Foundation https://www.cncf.io +1-415-233-1000 https://www.dankohn.com

On Sun, May 6, 2018 at 3:38 AM, Łukasz Gryglicki notifications@github.com wrote:

All data on this chart https://k8s.cncftest.io/d/1/activity-repository-groups?orgId=1 comes from Postgres now. So annotations, template variables, drop-downs, chart. This proves that we can use Postgres as a data source. Now I'm generating data for ALL charts, then I will have to update all of them to use different data source and different sysntax for everything everywhere. Postgres TS database is the same as GHA database - just a single one that contains all data. You can aceess it from ssh bastion.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cncf/devstats/issues/106#issuecomment-386860508, or mute the thread https://github.com/notifications/unsubscribe-auth/AC8MBhFvdpcUTUQ8dE99il2saJIDOKa-ks5tvqhtgaJpZM4TvfEJ .

lukaszgryglicki commented 6 years ago

Play with this https://k8s.cncftest.io/d/1/activity-repository-groups?orgId=1 I don't see ANY difference in Grafana performance. You can directly compare to InfluxDB one from production: https://k8s.devstats.cncf.io/d/1/activity-repository-groups?orgId=1 Generating data for this chart is about 6 minutes (for all time data). Very similar to InfluxDB one. But I don't have overall stats yet. I'll of course report once I have them.

Amount of changes for postgres as a TS DB (PgaaTSDB) is huge: https://github.com/cncf/devstats/compare/psql_as_tsdb

lukaszgryglicki commented 6 years ago

So I've generated all data during the night. It created 50000+ tables in PSQL.

2 problems:

There were problems with too long identifier names (Postgres has 63 chars limit that cannot be changed without recompiling). I need to update fiunctions that generate table names to handle this.
I think I should merge all tables for different periods (like hourly, daily etc) into single tables containing period column. That will dramatically reduce number of tables needed.

I still think that postgres will work ok as TS DB, just need to make another iteration. Actually today is Monday, all work done before was just after the conference or during the weekend.

lukaszgryglicki commented 6 years ago

Solved two problems described above (by adding special/optional merge mode. In merge mode multiple series can be saved under one table, with column "series" containing their supposed series name). Also problems with indexes and Giant lock are solved too. Now will try to generate full data for just amy 2018 to see if there are no errors and then will just continue...

lukaszgryglicki commented 6 years ago

Finally after lons struggle (see this branch commit history) managed to get rid of Giant LOCK, made histograms works too and balanced series between columns and tables. The final numbe rof series-like table sis < 1000, it was 50000+ for influx. Data seems balanced for me and all metrics worked for just May 2018 (after fixing maybe 5-6 class of errors). Now I'll attempt to generate data for all time.

Then I'll be replacing dashboards queries to use only PSQL. I'm expecting a lot of problems there, but no serious problems that cannot be fixed. Just need a time. I'll be replacing queries for different "classes" of dashboards to identify problems ASAP and possibly regenerate data.

Once I'm sure it will work and I'll do that....

Test coverage
Docs
All shell script
CI
CD
Deploy scripts
probably many more...

But YES we can use Postgress, it seems to be fast, I';ve found a way to use multivalued psql series in Grafana v5.1. So I can just say - we can do it but I need a ton of time to finish it.

lukaszgryglicki commented 6 years ago

@dankohn @spiffxp full regenerate of TS data using PSQL takes 1h45mins for K8s. (It also worked with POC giant lock approach, but it was about 20-25 hours more). For InfluxDB it was taking more than 2 hours but less than 3 AFAIK. So generating postgres data is about 1.2-1.6 times faster. I'll check default cron sync now - that happens every hour.

I see some more tweaks will be needed in series names, so I'll do another round tomorrow, but I think I'm now very very close to the final PSQLaaTSDB. And then all remaining work will start.

lukaszgryglicki commented 6 years ago

Hourly sync (which will run from cron once I merge this) took 2m52s, which is about the same as when using Influx. I remember both higher and lower Influx values, that just depends on data and time of day (some metrics are calculated only at specific hours etc).

Due to redesigning series naming (splitting between table and column names), TSDB data is now 575 tables (i stead of 50000+ series in Influx). Backing it up and restoring is crazy fast using just pg_dump tool.

No more problems with backing up Influx, renaming, no more special Influx tools. Just a single Postgres DB that contains both GHA and TS data, and its dump is just 1650 Mb.

I will proceed tomorrow with actual full replacement Influx with PSQL as a TSDB. I'm now sure PSQL will work as TSDB just fine or better than Influx.

lukaszgryglicki commented 6 years ago

I've generated data for all CNCF project using PSQL. No more issues found during the process. It took few hours, I was porting dashboards in the meantime. I think we will have cron job with PSQL running today, and this experimental branch merged to master (and deployed on the test server), then:

Dashboards on all projects
Test coverage
Docs
All shell scripts
CI/CD, automatic deployments
Probably many more...

2018-05-09 08:28:57 root@cncftest:~/dev/go/src/devstats# GHA2DB_TMOFFSET="-4" GHA2DB_PROJECTS_OVERRIDE="+cncf" GHA2DB_LOCAL=1 PG_PASS={{redacted}} ./devstats
2018-05-09 08:29:10 /devstats: Updating git repos for all projects
2018-05-09 08:29:19 /devstats: Updated git repos, took: 8.267141017s
2018-05-09 08:29:19 /devstats: Syncing #1 kubernetes
2018-05-09 08:31:31 /devstats: Synced kubernetes, took: 2m12.763972231s
2018-05-09 08:31:31 /devstats: Syncing #2 prometheus
2018-05-09 08:31:47 /devstats: Synced prometheus, took: 15.701201545s
2018-05-09 08:31:47 /devstats: Syncing #3 opentracing
2018-05-09 08:31:58 /devstats: Synced opentracing, took: 10.971503034s
2018-05-09 08:31:58 /devstats: Syncing #4 fluentd
2018-05-09 08:32:12 /devstats: Synced fluentd, took: 13.548645482s
2018-05-09 08:32:12 /devstats: Syncing #5 linkerd
2018-05-09 08:32:22 /devstats: Synced linkerd, took: 10.252184776s
2018-05-09 08:32:22 /devstats: Syncing #6 grpc
2018-05-09 08:32:39 /devstats: Synced grpc, took: 17.349142192s
2018-05-09 08:32:39 /devstats: Syncing #7 coredns
2018-05-09 08:32:51 /devstats: Synced coredns, took: 11.509969584s
2018-05-09 08:32:51 /devstats: Syncing #8 containerd
2018-05-09 08:33:01 /devstats: Synced containerd, took: 10.57986459s
2018-05-09 08:33:01 /devstats: Syncing #9 rkt
2018-05-09 08:33:12 /devstats: Synced rkt, took: 10.890258949s
2018-05-09 08:33:12 /devstats: Syncing #10 cni
2018-05-09 08:33:22 /devstats: Synced cni, took: 9.393776557s
2018-05-09 08:33:22 /devstats: Syncing #11 envoy
2018-05-09 08:33:40 /devstats: Synced envoy, took: 18.213643439s
2018-05-09 08:33:40 /devstats: Syncing #12 jaeger
2018-05-09 08:33:51 /devstats: Synced jaeger, took: 10.973505173s
2018-05-09 08:33:51 /devstats: Syncing #13 notary
2018-05-09 08:34:04 /devstats: Synced notary, took: 13.102780083s
2018-05-09 08:34:04 /devstats: Syncing #14 tuf
2018-05-09 08:34:15 /devstats: Synced tuf, took: 11.308650114s
2018-05-09 08:34:15 /devstats: Syncing #15 rook
2018-05-09 08:34:25 /devstats: Synced rook, took: 9.632520543s
2018-05-09 08:34:25 /devstats: Syncing #16 vitess
2018-05-09 08:34:36 /devstats: Synced vitess, took: 11.127305723s
2018-05-09 08:34:36 /devstats: Syncing #17 nats
2018-05-09 08:34:48 /devstats: Synced nats, took: 11.956633875s
2018-05-09 08:34:48 /devstats: Syncing #18 opa
2018-05-09 08:34:57 /devstats: Synced opa, took: 9.378392006s
2018-05-09 08:34:57 /devstats: Syncing #19 spiffe
2018-05-09 08:35:09 /devstats: Synced spiffe, took: 11.666955889s
2018-05-09 08:35:09 /devstats: Syncing #20 spire
2018-05-09 08:35:20 /devstats: Synced spire, took: 11.501200649s
2018-05-09 08:35:20 /devstats: Syncing #100 opencontainers
2018-05-09 08:35:31 /devstats: Synced opencontainers, took: 10.688608393s
2018-05-09 08:35:31 /devstats: Syncing #254 all
2018-05-09 08:37:22 /devstats: Synced all, took: 1m50.774056178s
2018-05-09 08:37:22 /devstats: Syncing #255 cncf
2018-05-09 08:37:28 /devstats: Synced cncf, took: 6.038215684s
2018-05-09 08:37:28 /devstats: Synced all projects in: 8m17.661861129s

Now all this work is merged into master branch.

lukaszgryglicki commented 6 years ago

Test server now runs postgres-only cron. Now I'm updating dashboards (k8s) one by one to change their data source. So now *.cncftest.io mostly doesn't work. Few first dashboards from k8s.cncftest.io already use PSQL DS. Also bastion ssh host allows to examine psql data.

lukaszgryglicki commented 6 years ago

Companies table now ported (first histogram). Also companies stats by repo groups ported. I've decided to drop "companies velocity" - it had just 6 charts displaying the same data as "companies stats by repo groups" when different metrcis are selected. It was slower (because needed to draw 6 charts) and no more needed when verion with metric drop-down was added later. Basically I'm on the "i" letter going from up to down porting dashboards (when I finish developer activity counts table).

lukaszgryglicki commented 6 years ago

After having cron running with psql, results seem to be better than Influx overall:

2018-05-09 15:07:24 root@cncftest:~/dev/go/src/devstats# cat /tmp/gha2db_sync.log | grep 'Synced all proj'
2018-05-09 09:17:17 /devstats: Synced all projects in: 9m16.02351686s
2018-05-09 10:18:09 /devstats: Synced all projects in: 10m7.731429405s
2018-05-09 11:16:56 /devstats: Synced all projects in: 8m55.100043848s
2018-05-09 12:16:49 /devstats: Synced all projects in: 8m48.533675408s
2018-05-09 13:17:13 /devstats: Synced all projects in: 9m11.302038729s
2018-05-09 14:17:17 /devstats: Synced all projects in: 9m15.960236899s
2018-05-09 15:17:35 /devstats: Synced all projects in: 9m33.93054373s
2018-05-09 16:19:07 /devstats: Synced all projects in: 11m5.956899479s
2018-05-09 17:17:49 /devstats: Synced all projects in: 9m47.121225579s
2018-05-09 18:17:35 /devstats: Synced all projects in: 9m33.959628598s

Will see tomorrow after full 24 hours pass. Some data is only calculated at specific hours. Some special values like tags or annotatnions at midnight. DB backups at 3am etc etc. But I do see performance boost over IDB, really.

spiffxp commented 6 years ago

@lukaszgryglicki you are a beast! thanks for keeping this up to date, I will play around and see if anything looks wildly out of line

lukaszgryglicki commented 6 years ago

Definitely postgres behaves a lot better under heavy load. Most time consuming sync happens at midnight and few next hours. Influx was taking anywhere from 40m to 1h20m here, and postgres is way better:

2018-05-09 09:17:17 /devstats: Synced all projects in: 9m16.02351686s
2018-05-09 10:18:09 /devstats: Synced all projects in: 10m7.731429405s
2018-05-09 11:16:56 /devstats: Synced all projects in: 8m55.100043848s
2018-05-09 12:16:49 /devstats: Synced all projects in: 8m48.533675408s
2018-05-09 13:17:13 /devstats: Synced all projects in: 9m11.302038729s
2018-05-09 14:17:17 /devstats: Synced all projects in: 9m15.960236899s
2018-05-09 15:17:35 /devstats: Synced all projects in: 9m33.93054373s
2018-05-09 16:19:07 /devstats: Synced all projects in: 11m5.956899479s
2018-05-09 17:17:49 /devstats: Synced all projects in: 9m47.121225579s
2018-05-09 18:17:35 /devstats: Synced all projects in: 9m33.959628598s
2018-05-09 19:17:23 /devstats: Synced all projects in: 9m22.450361222s
2018-05-09 20:17:09 /devstats: Synced all projects in: 9m8.589393948s
2018-05-09 21:17:17 /devstats: Synced all projects in: 9m15.337409773s
2018-05-09 22:18:34 /devstats: Synced all projects in: 10m33.149076448s
2018-05-09 23:17:07 /devstats: Synced all projects in: 9m5.757967824s
2018-05-10 00:21:20 /devstats: Synced all projects in: 13m18.925154922s
2018-05-10 01:16:25 /devstats: Synced all projects in: 8m24.856155902s
2018-05-10 02:15:54 /devstats: Synced all projects in: 7m53.492984835s
2018-05-10 03:35:54 /devstats: Synced all projects in: 27m52.290835521s
2018-05-10 04:17:24 /devstats: Synced all projects in: 9m23.66391798s

Continuying dashboards from "i" letter...

Also file sizes. Those *.dump file scontain both GHA and TS data for all CNCF projects:

-rw-r--r-- 1 postgres postgres 2091823688 May 10 04:52 /var/www/html/allprj.dump
-rw-r--r-- 1 postgres postgres   13199571 May 10 04:52 /var/www/html/cncf.dump
-rw-r--r-- 1 postgres postgres    7061974 May 10 04:46 /var/www/html/cni.dump
-rw-r--r-- 1 postgres postgres   16804840 May 10 04:46 /var/www/html/containerd.dump
-rw-r--r-- 1 postgres postgres 2187596024 Apr  4 20:42 /var/www/html/contrib.dump
-rw-r--r-- 1 postgres postgres   10573239 May 10 04:46 /var/www/html/coredns.dump
-rw-r--r-- 1 postgres postgres   23170301 Mar 16 04:44 /var/www/html/devstats.dump
-rw-r--r-- 1 postgres postgres   33421144 May 10 04:46 /var/www/html/envoy.dump
-rw-r--r-- 1 postgres postgres   32763017 May 10 04:45 /var/www/html/fluentd.dump
-rw-r--r-- 1 postgres postgres 1732708961 May 10 04:44 /var/www/html/gha.dump
-rw-r--r-- 1 postgres postgres  135838791 May 10 04:46 /var/www/html/grpc.dump
-rw-r--r-- 1 postgres postgres   12399055 May 10 04:46 /var/www/html/jaeger.dump
-rw-r--r-- 1 postgres postgres   15329611 May 10 04:45 /var/www/html/linkerd.dump
-rw-r--r-- 1 postgres postgres   19765705 May 10 04:47 /var/www/html/nats.dump
-rw-r--r-- 1 postgres postgres   10179856 May 10 04:46 /var/www/html/notary.dump
-rw-r--r-- 1 postgres postgres    4881449 May 10 04:47 /var/www/html/opa.dump
-rw-r--r-- 1 postgres postgres   31298439 May 10 04:47 /var/www/html/opencontainers.dump
-rw-r--r-- 1 postgres postgres   15513262 May 10 04:45 /var/www/html/opentracing.dump
-rw-r--r-- 1 postgres postgres   73338088 May 10 04:45 /var/www/html/prometheus.dump
-rw-r--r-- 1 postgres postgres   25105370 May 10 04:46 /var/www/html/rkt.dump
-rw-r--r-- 1 postgres postgres   11703021 May 10 04:47 /var/www/html/rook.dump
-rw-r--r-- 1 postgres postgres    2933345 May 10 04:47 /var/www/html/spiffe.dump
-rw-r--r-- 1 postgres postgres    2817887 May 10 04:47 /var/www/html/spire.dump
-rw-r--r-- 1 postgres postgres    6680512 May 10 04:46 /var/www/html/tuf.dump
-rw-r--r-- 1 postgres postgres   20427525 May 10 04:47 /var/www/html/vitess.dump

And those are from production, they only contain GHA data, TS data is in 20 Influx instances:

-rw-r--r-- 1 postgres postgres 2067016079 May 10 04:50 /var/www/html/allprj.dump
-rw-r--r-- 1 postgres postgres    5357313 May 10 04:45 /var/www/html/cni.dump
-rw-r--r-- 1 postgres postgres   14285741 May 10 04:44 /var/www/html/containerd.dump
-rw-r--r-- 1 postgres postgres    8478477 May 10 04:44 /var/www/html/coredns.dump
-rw-r--r-- 1 postgres postgres   22432290 Mar 16 04:45 /var/www/html/devstats.dump
-rw-r--r-- 1 postgres postgres   30941818 May 10 04:45 /var/www/html/envoy.dump
-rw-r--r-- 1 postgres postgres   24037492 May 10 04:44 /var/www/html/fluentd.dump
-rw-r--r-- 1 postgres postgres 1682823352 May 10 04:44 /var/www/html/gha.dump
-rw-r--r-- 1 postgres postgres  129120641 May 10 04:44 /var/www/html/grpc.dump
-rw-r--r-- 1 postgres postgres    9740269 May 10 04:45 /var/www/html/jaeger.dump
-rw-r--r-- 1 postgres postgres   13264408 May 10 04:44 /var/www/html/linkerd.dump
-rw-r--r-- 1 postgres postgres   13009246 May 10 04:45 /var/www/html/nats.dump
-rw-r--r-- 1 postgres postgres    8770884 May 10 04:45 /var/www/html/notary.dump
-rw-r--r-- 1 postgres postgres    3481785 May 10 04:45 /var/www/html/opa.dump
-rw-r--r-- 1 postgres postgres   27573420 May 10 04:45 /var/www/html/opencontainers.dump
-rw-r--r-- 1 postgres postgres   11587769 May 10 04:44 /var/www/html/opentracing.dump
-rw-r--r-- 1 postgres postgres   60707562 May 10 04:44 /var/www/html/prometheus.dump
-rw-r--r-- 1 postgres postgres   22906178 May 10 04:45 /var/www/html/rkt.dump
-rw-r--r-- 1 postgres postgres   10389535 May 10 04:45 /var/www/html/rook.dump
-rw-r--r-- 1 postgres postgres    2036895 May 10 04:45 /var/www/html/spiffe.dump
-rw-r--r-- 1 postgres postgres    2423657 May 10 04:45 /var/www/html/spire.dump
-rw-r--r-- 1 postgres postgres    4764125 May 10 04:45 /var/www/html/tuf.dump
-rw-r--r-- 1 postgres postgres   17888959 May 10 04:45 /var/www/html/vitess.dump

As you can see difference for k8s is: 1732708961 - 1682823352 Which is just about 50 Mb in compressed dump size.

Additionally all data (GHA and TS) is on the same database, so we can do tricks like using tags values in metric queries, which allows to calculate result exactly for the values from the drop-down or values list displayed on Grafana.

I don't see a single disadvantage of replacing Influx with Postgres. And I'm amazed that Postgres that was not designed to be used as TS DB is faster than dedicated one.

lukaszgryglicki commented 6 years ago

When data for psql TSDB was generated, I forgot about exclude repos, so I now have some data that shouldn't be there. I need to regenerate all data, but first I'll port all dashboards in case I find some other errors - to save time. Issues age gives slightly other results for last week(s) now due to this.

lukaszgryglicki commented 6 years ago

Actually now all dashboards from top to ‘PR workload per SIG table’ (including it) are already ported to Postgres. I needed to add a tool that makes sure that some series (this is not a good name for postgres table anyway) have all required columns. In a typical TS DB we don't need to create tables and columns to define series. Biut with postgres we have to. So for example for User reviewers chart some user(s) have no reviews. In that case no column will be created for that user and then we will see SQL error when attempting to select it.

Solution is simple:

We need a tool that knows all possible column names (columns come from drop-downs, which come from tags variables (select tag_column_name form tags_table)), then we need a list of tables that should always have those columns (regexp). Finally we check if all columns are present on specific tables adding missing columns on the way.

Tool is here its config is here, it is called there (it runs at the same conditions as tags recalculation, but after data is generated) to make sure that no query fail on the Grafana UI.

lukaszgryglicki commented 6 years ago

Now also doing:

Check -race condition, improved log creation by using sync.Once and sync.RWMutex - use Lock() when contex changes, but in other cases just use RLock(), tested full sync with debug info ON, -race on to see if no single race condition is detected.

This is needed again when switched DB engine, devstats is crazy multithreaded and servers are 48 cores, so I'm a bit crazy about double-checing for any race conditions. Binaries compiled with -race are a lot slower, but this sometimes saves days of debugging.

jberkus commented 6 years ago

I'll plan on doing a performance audit of the postgresql DB sometime this summer. In the meantime, if there's specific graphs/operations that are being slow, I can probably fix them, so file an issue and assign them to me. Postgres perf is what I used to do for a living.

lukaszgryglicki commented 6 years ago

Thanks, you can review this file as an expert (search for TSPoint, all TS emulation happens in WriteTSPoints function). Currently, I have no bottlenecks, and psql performs better than Influx. But I'll, of course, ask when I stuck on anything.

lukaszgryglicki commented 6 years ago

Implemented all dashboards up to PRs labels repo groups (including). See https://k8s.cncftest.io/d/47/prs-labels-repository-groups?orgId=1 This one was hardest one so far. It selects label-repo combination (single) for singlestats and one top chart. But then it needs to select multiple columns in X, Y directions, like:

X --> label=const, all repo groups stacked, repo
Y --> repo group=const, all labels stacked. Also percent charts without "All XXX combined". Postgres + Grafana allows that (joins, column values as series name so value as metadata)!
You need anyting for time column
Then you need anything for value column
Finally you need series name (for Grafana UI), You can join the tags tables to get actual name. So this dashboard is the first dashboard that uses join to get its data. I don't see any speed degradataion. Table to join to has just 12 rows (labels) or 18 rows (repo groups). And we join by index columns. Please take a look, I needed to change data generation algorithm, add labels tags, restructure displaying data on this dashboard (percent charts are even better than prod IMHO, because they exclude All XXX combined which always took 50% of percent stacked chart):
https://github.com/cncf/devstats/commit/9c0f989f456113b0a0731852184f5899068c7cc6
https://k8s.cncftest.io/d/47/prs-labels-repository-groups?orgId=1

This is a query that pulls data for one of charts:

select
  $__time(l.time),
  l.value as value,
  lt.pr_labels_tags_name as metric
from
  sprs_labels l,
  tpr_labels_tags lt
where
  l.series = 'prlbl[[repogroup]]' || lt.pr_labels_tags_value
  and lt.pr_labels_tags_name != 'All labels combined'
order by l.time;

screen shot 2018-05-11 at 11 46 37

cc @dankohn @jberkus @spiffxp

lukaszgryglicki commented 6 years ago

Just finished "SIG mentions by categories" - it was tricky because I've used join with tags to get name for "metric" column. The problem was that for different time values we have different metric names, but not always all possible metrics. Even Grafana's "null as zero" won't help here because we don't even have nulls for some metrics values at some time points. I did the trick: First create cartesian product of all TS datetimes x all metric names (metric names come from dynamic query). That way we have a big N x M matrix of TS points needed. Then I just do select * from this cartesian with left join on actual data we have with coalesce(data we have, 0). That way we have matrix N x M of either data values we have or 0's. That make Grafana happy and we can display stacked values correctly, see:

https://k8s.cncftest.io/d/40/sig-mentions-categories?orgId=1

with series as (
select distinct bd.time,
tc.sig_mentions_texts_cat_name as metric_name,
tc.sig_mentions_texts_cat_value as metric_value
from
ssigm_txt_bd bd,
tsig_mentions_texts_cats tc
where
$__timeFilter(bd.time)
and bd.period = '[[period]]'
)
select
s.time as time,
s.metric_name as metric,
coalesce(bd.value, 0) as value
from
series s
left join
ssigm_txt_bd bd
on
bd.series = 'sigm_txt_bd[[sig]]' || s.metric_value
and bd.time = s.time
and bd.period = '[[period]]'
order by
s.time

And

https://k8s.cncftest.io/d/47/prs-labels-repository-groups?orgId=1

with series as (
select distinct l.time as time,
r.all_combined_repo_group_name as metric_name,
r.all_combined_repo_group_value as metric_value
from
sprs_labels l,
tall_combined_repo_groups r
where
$__timeFilter(l.time)
and r.all_combined_repo_group_name != 'All repos combined'
)
select
s.time as time,
s.metric_name as metric,
coalesce(l.value, 0) as value
from
series s
left join
sprs_labels l
on
l.series = 'prlbl' || s.metric_value || '[[label]]'                     
and l.time = s.time
order by
time

and

with series as (
select distinct l.time as time,
r.all_combined_repo_group_name as metric_name,
r.all_combined_repo_group_value as metric_value
from
sprs_labels l,
tall_combined_repo_groups r
where
$__timeFilter(l.time)
and r.all_combined_repo_group_name != 'All repos combined'
)
select
s.time as time,
s.metric_name as metric,
coalesce(l.value, 0) as value
from
series s
left join
sprs_labels l
on
l.series = 'prlbl' || s.metric_value || '[[label]]'                     
and l.time = s.time
order by
time

I need to think about multiple multi-selects tomorrow. I this case we have a list of comma separated values selected by the user as a string. I cannot use this to select columns in PSQL query (by using column name or even LIKE '...%...'). I can probably parse comma separated string to return multiple rows with a single values. Once I figure out how to do it, I will be able to parse any number of multi-selects with dynamic column names just like the queries listed above. May be a bit tricky, I'll see tomorrow.

lukaszgryglicki commented 6 years ago

BTW: postgres does a great job.

No single failure in 4 or 5 days live on test.
Sync times about 50% time of those using Influx.

This is a detailed log of all CNCF Projects sync times since I connected postgres to cron:

(standard input):49:2018-05-09 09:17:17 /devstats: Synced all projects in: 9m16.02351686s
(standard input):98:2018-05-09 10:18:09 /devstats: Synced all projects in: 10m7.731429405s
(standard input):147:2018-05-09 11:16:56 /devstats: Synced all projects in: 8m55.100043848s
(standard input):196:2018-05-09 12:16:49 /devstats: Synced all projects in: 8m48.533675408s
(standard input):245:2018-05-09 13:17:13 /devstats: Synced all projects in: 9m11.302038729s
(standard input):294:2018-05-09 14:17:17 /devstats: Synced all projects in: 9m15.960236899s
(standard input):343:2018-05-09 15:17:35 /devstats: Synced all projects in: 9m33.93054373s
(standard input):392:2018-05-09 16:19:07 /devstats: Synced all projects in: 11m5.956899479s
(standard input):441:2018-05-09 17:17:49 /devstats: Synced all projects in: 9m47.121225579s
(standard input):490:2018-05-09 18:17:35 /devstats: Synced all projects in: 9m33.959628598s
(standard input):539:2018-05-09 19:17:23 /devstats: Synced all projects in: 9m22.450361222s
(standard input):588:2018-05-09 20:17:09 /devstats: Synced all projects in: 9m8.589393948s
(standard input):637:2018-05-09 21:17:17 /devstats: Synced all projects in: 9m15.337409773s
(standard input):686:2018-05-09 22:18:34 /devstats: Synced all projects in: 10m33.149076448s
(standard input):735:2018-05-09 23:17:07 /devstats: Synced all projects in: 9m5.757967824s
(standard input):784:2018-05-10 00:21:20 /devstats: Synced all projects in: 13m18.925154922s
(standard input):833:2018-05-10 01:16:25 /devstats: Synced all projects in: 8m24.856155902s
(standard input):882:2018-05-10 02:15:54 /devstats: Synced all projects in: 7m53.492984835s
(standard input):931:2018-05-10 03:35:54 /devstats: Synced all projects in: 27m52.290835521s
(standard input):980:2018-05-10 04:17:24 /devstats: Synced all projects in: 9m23.66391798s
(standard input):1029:2018-05-10 05:15:53 /devstats: Synced all projects in: 7m52.497051848s
(standard input):1078:2018-05-10 06:18:50 /devstats: Synced all projects in: 10m49.752916355s
(standard input):1127:2018-05-10 07:21:09 /devstats: Synced all projects in: 13m8.349389289s
(standard input):1176:2018-05-10 08:16:21 /devstats: Synced all projects in: 8m20.247234301s
(standard input):1225:2018-05-10 09:16:56 /devstats: Synced all projects in: 8m55.906777962s
(standard input):1274:2018-05-10 10:18:16 /devstats: Synced all projects in: 10m15.176040212s
(standard input):1323:2018-05-10 11:16:57 /devstats: Synced all projects in: 8m55.990261444s
(standard input):1372:2018-05-10 12:16:51 /devstats: Synced all projects in: 8m49.697073157s
(standard input):1421:2018-05-10 13:16:52 /devstats: Synced all projects in: 8m50.852524561s
(standard input):1471:2018-05-10 16:19:12 /devstats: Synced all projects in: 11m11.176455365s
(standard input):1520:2018-05-10 17:17:49 /devstats: Synced all projects in: 9m47.833332429s
(standard input):1569:2018-05-10 18:17:44 /devstats: Synced all projects in: 9m43.005344793s
(standard input):1618:2018-05-10 19:17:22 /devstats: Synced all projects in: 9m20.67762445s
(standard input):1667:2018-05-10 20:17:22 /devstats: Synced all projects in: 9m20.293590885s
(standard input):1716:2018-05-10 21:17:45 /devstats: Synced all projects in: 9m43.989355519s
(standard input):1765:2018-05-10 22:18:44 /devstats: Synced all projects in: 10m42.457674795s
(standard input):1814:2018-05-10 23:16:53 /devstats: Synced all projects in: 8m52.399322218s
(standard input):1863:2018-05-11 00:21:28 /devstats: Synced all projects in: 13m26.711882602s
(standard input):1912:2018-05-11 01:16:12 /devstats: Synced all projects in: 8m10.711520003s
(standard input):1961:2018-05-11 02:15:45 /devstats: Synced all projects in: 7m43.781349853s
(standard input):2010:2018-05-11 03:36:24 /devstats: Synced all projects in: 28m23.019616262s
(standard input):2059:2018-05-11 04:17:26 /devstats: Synced all projects in: 9m25.153023661s
(standard input):2108:2018-05-11 05:15:50 /devstats: Synced all projects in: 7m49.712609446s
(standard input):2157:2018-05-11 06:18:45 /devstats: Synced all projects in: 10m44.201502467s
(standard input):2206:2018-05-11 07:20:55 /devstats: Synced all projects in: 12m53.895361989s
(standard input):2255:2018-05-11 08:16:28 /devstats: Synced all projects in: 8m27.195954846s
(standard input):2304:2018-05-11 09:16:52 /devstats: Synced all projects in: 8m51.629033028s
(standard input):2353:2018-05-11 10:18:16 /devstats: Synced all projects in: 10m15.647633052s
(standard input):2402:2018-05-11 11:16:33 /devstats: Synced all projects in: 8m31.793465709s
(standard input):2451:2018-05-11 12:16:32 /devstats: Synced all projects in: 8m30.034420393s
(standard input):2500:2018-05-11 13:17:10 /devstats: Synced all projects in: 9m8.728948757s
(standard input):2549:2018-05-11 14:17:14 /devstats: Synced all projects in: 9m12.778100873s
(standard input):2598:2018-05-11 15:17:23 /devstats: Synced all projects in: 9m22.343415285s
(standard input):2647:2018-05-11 16:19:15 /devstats: Synced all projects in: 11m14.086812481s
(standard input):2696:2018-05-11 17:17:46 /devstats: Synced all projects in: 9m45.743144303s
(standard input):2745:2018-05-11 18:17:21 /devstats: Synced all projects in: 9m20.358787792s
(standard input):2794:2018-05-11 19:17:17 /devstats: Synced all projects in: 9m15.76357377s
(standard input):2843:2018-05-11 20:17:13 /devstats: Synced all projects in: 9m11.662892093s
(standard input):2892:2018-05-11 21:17:26 /devstats: Synced all projects in: 9m24.72820368s
(standard input):2941:2018-05-11 22:18:42 /devstats: Synced all projects in: 10m41.076609165s
(standard input):2990:2018-05-11 23:16:36 /devstats: Synced all projects in: 8m35.074264692s
(standard input):3039:2018-05-12 00:21:08 /devstats: Synced all projects in: 13m6.967214292s
(standard input):3088:2018-05-12 01:15:47 /devstats: Synced all projects in: 7m46.557434303s
(standard input):3137:2018-05-12 02:15:18 /devstats: Synced all projects in: 7m17.23193088s
(standard input):3186:2018-05-12 03:36:02 /devstats: Synced all projects in: 28m1.348940428s
(standard input):3235:2018-05-12 04:16:52 /devstats: Synced all projects in: 8m51.679619943s
(standard input):3284:2018-05-12 05:15:19 /devstats: Synced all projects in: 7m18.034705503s
(standard input):3333:2018-05-12 06:18:10 /devstats: Synced all projects in: 10m9.292369875s
(standard input):3382:2018-05-12 07:19:38 /devstats: Synced all projects in: 11m36.700740431s
(standard input):3431:2018-05-12 08:15:25 /devstats: Synced all projects in: 7m24.093814886s
(standard input):3480:2018-05-12 09:15:42 /devstats: Synced all projects in: 7m40.263972851s
(standard input):3529:2018-05-12 10:17:10 /devstats: Synced all projects in: 9m9.114212754s
(standard input):3578:2018-05-12 11:15:20 /devstats: Synced all projects in: 7m18.445290232s
(standard input):3627:2018-05-12 12:15:26 /devstats: Synced all projects in: 7m25.258221219s
(standard input):3676:2018-05-12 13:15:52 /devstats: Synced all projects in: 7m50.090605834s
(standard input):3725:2018-05-12 14:15:43 /devstats: Synced all projects in: 7m42.735060808s
(standard input):3774:2018-05-12 15:15:49 /devstats: Synced all projects in: 7m48.063506569s
(standard input):3823:2018-05-12 16:17:34 /devstats: Synced all projects in: 9m33.243364995s
(standard input):3872:2018-05-12 17:16:03 /devstats: Synced all projects in: 8m1.972618639s
(standard input):3921:2018-05-12 18:15:42 /devstats: Synced all projects in: 7m40.906109953s
(standard input):3970:2018-05-12 19:15:50 /devstats: Synced all projects in: 7m49.427854534s
(standard input):4019:2018-05-12 20:15:44 /devstats: Synced all projects in: 7m42.653975923s
(standard input):4068:2018-05-12 21:15:50 /devstats: Synced all projects in: 7m48.528172099s
(standard input):4117:2018-05-12 22:17:24 /devstats: Synced all projects in: 9m22.353820156s
(standard input):4166:2018-05-12 23:15:26 /devstats: Synced all projects in: 7m25.340764631s
(standard input):4215:2018-05-13 00:19:55 /devstats: Synced all projects in: 11m53.71099773s
(standard input):4264:2018-05-13 01:15:02 /devstats: Synced all projects in: 7m0.607304628s
(standard input):4313:2018-05-13 02:14:56 /devstats: Synced all projects in: 6m55.11967206s
(standard input):4362:2018-05-13 03:35:53 /devstats: Synced all projects in: 27m51.68737235s
(standard input):4411:2018-05-13 04:16:37 /devstats: Synced all projects in: 8m35.993087824s
(standard input):4460:2018-05-13 05:15:04 /devstats: Synced all projects in: 7m3.432966498s
(standard input):4509:2018-05-13 06:18:04 /devstats: Synced all projects in: 10m3.586240382s
(standard input):4558:2018-05-13 07:19:02 /devstats: Synced all projects in: 11m0.888559012s
(standard input):4607:2018-05-13 08:15:03 /devstats: Synced all projects in: 7m1.397977688s
(standard input):4656:2018-05-13 09:15:34 /devstats: Synced all projects in: 7m32.539722799s
(standard input):4705:2018-05-13 10:17:04 /devstats: Synced all projects in: 9m3.179977044s
(standard input):4754:2018-05-13 11:15:20 /devstats: Synced all projects in: 7m19.099508279s
(standard input):4803:2018-05-13 12:15:25 /devstats: Synced all projects in: 7m23.56602347s
(standard input):4852:2018-05-13 13:15:41 /devstats: Synced all projects in: 7m39.382575339s
(standard input):4901:2018-05-13 14:15:38 /devstats: Synced all projects in: 7m37.388943205s

lukaszgryglicki commented 6 years ago

Multi value drop down values parse in where condition solved and works (SIG mentions categories dashboard):

with series as (
  select distinct bd.time,
    tc.sig_mentions_texts_cat_name as metric_name,
    tc.sig_mentions_texts_cat_value as metric_value
  from
    ssigm_txt_bd bd,
    tsig_mentions_texts_cats tc
  join (
    select trim(both '"' from unnest(string_to_array('${sigcats:csv}', ','))) as cat
  ) cats
  on
    cats.cat = tc.sig_mentions_texts_cat_name
  where
    $__timeFilter(bd.time)
    and bd.period = '[[period]]'
)
select
  s.time as time,
  s.metric_name as metric,
  coalesce(bd.value, 0) as value
from
  series s
left join
  ssigm_txt_bd bd
on
  bd.series = 'sigm_txt_bd[[sig]]' || s.metric_value
  and bd.time = s.time
  and bd.period = '[[period]]'
order by
  s.time

See unpacking string to rows in this part: select trim(both '"' from unnest(string_to_array('${sigcats:csv}', ','))) as cat.

lukaszgryglicki commented 6 years ago

All k8s dashboards ported. Now I'll port other project's dashboards (all of them are very similar, but there are plenty of them, I already have some mass update tools - I'll use them).

lukaszgryglicki commented 6 years ago

Home & Activity repo groups dashbords ported for all CNCF projects.
InfluxDB data source removed on all CNCF projects.
Some shared template variables made to use postgres on all dashboards in all CNCF projects.

lukaszgryglicki commented 6 years ago

devstats now use postgres as a time series DB on both test & prod servers.
OS was updated on both servers to Ubuntu 18.04 LTS
Postgres was updated on both to 10.3.

Closing.

cncf / devstats.archive

Investigate driving grafana directly from postgres instead of influxdb #106