Design proposal for the recording and presentation of deployment usage stats

adulbrich commented 3 years ago

Related to #2528

imsdu commented 3 years ago

On the collection of data to later compute the stats

An alternative to using InfluxDB would be to use elasticsearch as a time-series database:

We already make an extensive use of ES in the Nexus platform
Grafana has an Elasticsearch connector
InfluxDB is only highly available in the enterprise version
InfluxDB is another entry to properly deploy, monitor and to master in the platform (influxQL/flux, integration tests, ...)
Elasticsearch may be slower that InfluxDB for ingesting/querying but we don't really care about it as we can accept some delay in the stats (and there is the cassandra eventual consistency anyway)

An article that compares InfluxDB to another time-series: https://blog.timescale.com/blog/timescaledb-vs-influxdb-for-time-series-data-timescale-influx-sql-nosql-36489299877/ It is written by a competitor but they have valid points on InfluxDB.

Besides the deployment usage stats, this data could be used to power and replace the current implementation of ProjectCounts and StorageStatistics by making some calls to ES to compute the stats.

Visualization in Grafana

Copy/paste from https://grafana.github.io/grafonnet-lib/:

A dashboard in Grafana is represented by a JSON object. 

While this choice makes sense from a technical point of view, people who want to keep those dashboards under version control end up putting large, independent JSON files under source control.

When doing so, it is hard to maintain the same links, templates, or even annotation between graphs. It usually requires a lot of custom tooling to change and keep those Json files aligned. 

There are alternatives, like grafanalib, that makes thing easier. However, as Grafonnet is using Jsonnet, a superset of JSON, it gives you out of the box a very easy way to use any feature of grafana that would not be covered by Grafonnet already.

I never used it but I do know that maintaining these large json objects in git is a real pain.

umbreak commented 3 years ago

On the Elasticsearch side are you suggesting something like this ? I'm not sure if we need the ILM bit though...

Seems doable and probably easier than having influxDB, but it would require some work

imsdu commented 3 years ago

Without ILM and even most of the things that are in the page you linked as we would not use Kibana either.

The main difference with the views is that we would index the events and not the resources.

We could use the rollover api to write to a new index when the current one reaches a certain size but it is something we could skip too: https://www.elastic.co/guide/en/elasticsearch/reference/master/indices-rollover-index.html

In which way, do you think it would require more work than InfluxDB ?

The idea would be:

to rewrite/complete ProjectCounts and StorageStatistics so as not to compute the stats but to push relevant information for each event to ES
and to write aggregation queries to get the stats we need.

umbreak commented 3 years ago

I agree the in memory caches implementation (ProjectsCounts and StoragesStatistics) computed using fs2 streams can be replaced with ES and at the same time we can draw stats from it. If we store some of the basic information for each event (instant, project, deprecation status, event type, resource @type) in an index per project, we could answer the following questions:

number of active projects
project creation
project deprecation status
project deletion (I'm not sure what's the expectation of this...since If we delete the index and rerun the process, the numbers here will be different)
resource count per project
file count per project
total file size per project

However we would still not be able to answer the following question:

file size distribution per project (this is the size of the distribution field on the source payload). In order to achieve that we would have to do a bit of graph navigation / expanded Json-LD cursor navigation)

There are few disadvantages though:

Having to do an aggregate query to get the ProjectCounts and StoragesStatistics can be much more expensive / complex than a query to a cache. I'm not sure though what exact latency we would be expecting.

imsdu commented 3 years ago

For the project deletion and file size distribution, the question remains the same no matter if it is ES/InfluxDB but yeah these ones are the toughest ones.

For the latency, there will not be complex aggregations (nothing nested for example) and we can ask for only the one we are interested. And if we hit only one shard and expect ES cache to do its job, it should remain low.

imsdu commented 3 years ago

On the Delta side, the implementation could look like this:

sealed trait Action {
    case object Create
    case object Update
    case object Deprecate
    case object Tagged
    case object Deleted
  }

  final case class EventMetric(instant: Instant,
                               subject: Subject,
                               action: Action,
                               project: ProjectRef,
                               organization: Label,
                               id: Iri,
                               types: Set[Iri],
                               additionalFields: JsonObject) extends Metric

Where additionalFields would allow to hold specific to a type of resource like the size of a file or the size of a distribution.

A new method in EventExchange would allow to get the metric for an event:

def toMetric(event: Event): UIO[Option[EventMetric]]

The UIO is here at least for files for which we need to fetch the file to get the storage id as it is not present for every kind of event.

A stream would run on the project events, get the metric from the event and push to a single index that would store the metrics.

This index would be then queried to provide project and storage statistics for Delta and for the dashboards.

I tested on my laptop with around 10M events for ~9000 projects, the dashboards were quite reactive (around 2s for the most expensive one which was the sum of file size per project)

On the dashboard side:

It is easier to create dashboards with with the UI in Kibana (autocompletion for the field values helps a lot) than with Grafana (with or without grafonnet).

With grafonnet, it was even more difficult to get to a result so I think we can forget it. It is not ideal to have giant unreadable json blobs in git but it is even less ideal to create a dashboard in a week by somebody that know the jsonnet language when with Kibana, you can do it in half a day (or a day for someone that does not know Kibana)

Grafana: Pros:

Same UI for prometheus/elasticsearch data
More options for customizing look and feel with conditional formatting, changing the background, ...
Allow to import/export dashboards
Allow forms to reduce the dashboard stats (for a project for example)
Ability in Grafana 8.x to use library (=reusable) panels (not in the one currently deployed in prod)
Possibility to tags dashboards and panels to find them again

Cons:

Need to know the lucene query syntax
No auto-completion
When you fail in your query, feedback is poor so it can be hard to solve your error

Kibana: Pros:

Auto-completion is present almost everywhere so it helps a lot to build the dashboards
Ability to use lucene query syntax / query dsl / kibana query language
Feedback on errors is better
A lot easier to get a grip on (everybody should be able to contribute quite easily)
Allow to import/export dashboards
Allow forms to reduce the dashboard stats (for a project for example), multi-select is possible
Ability to use library (=reusable) panels
Can be used to create dashboards on views indices too (to get stats and/or debug)
Editor with auto-completion to create queries and history of the last queries (nice to debug the content of a index)

Cons:

Not as many options to customize the look and feel of the dashboards
Another app to deploy
Only for ES stuff

To have an idea of how grafana looks, look at the instance in production or watch screenshots here: https://grafana.com/grafana/

A dashboard I created for kibana (the data I generated is too uniform to have interesting charts but it gives an idea) : Screenshot 2021-08-23 at 17-04-19 Test - Elastic

A demo screenshot of a dataset included with Kibana: Screenshot 2021-08-23 at 17-05-27 eCommerce Revenue Dashboard - Elastic

umbreak commented 3 years ago

Action shouldn't be exposed like that, since every plugin can potentially have any "actions" (commands and events are not tight to Create/Update/Tag/Deprecate. Files, for example have events to do with attributes, which has nothing to do with create/update/...

umbreak commented 3 years ago

Powering ProjectsCounts with ES would mean that we will have remove the ProjectsCounts index of that project when issuing a delete of the project.

I just point it out here as something to take into account.

umbreak commented 3 years ago

Another thing to be considered:

For ViewStatistics we usually assumed that the projectsStatistics (retrieved from ProjectsCounts) were always ahead of the actual view stream counts (specially for compositeviews). That might not be the case anymore since the projects have to be indexed and made available in ES.

imsdu commented 3 years ago

This assumption was kind of difficult to make anyway no ? The streams are independent and don't work on the same events, everything is eventually consistent, ...

umbreak commented 3 years ago

Well, the assumption was that reading one stream and just adding counts to it would be faster than reading another stream + do json-ld conversions + index things into es.

It was not strictly guaranteed that one would finish before the other but in practice it does.

imsdu commented 3 years ago

I agree with you on this point.

But streams are mostly idling and who wins also depends on when each stream made its last poll

BlueBrain / nexus

Design proposal for the recording and presentation of deployment usage stats #2683