MatrixAI / Polykey

Polykey Core Library
https://polykey.com
GNU General Public License v3.0
29 stars 4 forks source link

Setup `audit` domain for tracking user/action events and metrics #628

Closed amydevs closed 10 months ago

amydevs commented 10 months ago

Specification

The audit domain of Polykey will be responsible for the auditing user-behaviour of Polykey nodes.

It should expose JS-RPC server-streaming handlers that yield audit events, and can also provide summary metrics of those audit events.

The subdomains of audit should be based on the domains of Polykey itself, and the audit domain should also contain Observable properties (see #444) derived from the events dispatched by each subdomain.

The JS-RPC API should be available via JS-WS, so that it is accessible from the services like the mainnet/testnet status page (see #599).

Furthermore, this JS-RPC API should replay all accumulated state for each metric upon initial opening of the server-streaming call, so that if the any connected services were to restart, they would be able to get all the existing metric data, much like how the rxjs shareReplay funciton.

Untitled-2023-10-23-0424 excalidraw(5)

Audit Events

Event Flow

Domain class instances will be injected dependencies into the Audit domain. This means that the other domains will be able to expose any data they want to record via Events without any semantics regarding the Audit domain. The Audit domain can listen to these events and record them in the database.

Database Schema

Using js-db, there will be several level for the Audit domain:

eventIds will be made using IdSortable, so that they are completely monotonic. Furthermore, events can be accessed by iterating over a topic level (audit/topic/{topicId}), which yields multiple eventIds. This will be used to reference the events stored in the audit/events/{eventId}. By doing this, events are able to be apart of multiple topics as well.

Topics can be nested meaning that querying topic path of ['node', 'connection'] will return all audit events from their children (['node', 'connection', 'reverse'] and ['node', 'connection', 'forward'])

API

The basic API will use an AsyncGenerator that yields events from a specified topic:

function* getAuditEvents(topicPath: Array<string>, options: { seek?: EventId, seekEnd?: EventId, order?: 'asc' | 'desc', limit?: number }, tran?: DBTransaction): AsyncGenerator<AuditEvent>

The options offer pagination, where the user can limit the number of audit events that the generator will yield and call the generator again with seek set to the EventId of the last element that was yielded.

The second API method yields events live as they are being dispatched:

function* getAuditEventsLongRunning(topicPath: Array<string>, options: { seek?: EventId, seekEnd?: EventId, limit?: number }): AsyncGenerator<AuditEvent>

The reason why the paramaters are different is because that the iteration of new events beyond what is currently stored within the DB cannot be in any order other than chronologically ascending. Furthermore, as this method requires for multiple db transaction snapshots, there is no point for the caller to pass in a transaction to perform on. Note that generator.return() or generator.throw() must be called on the returned AsyncGenerator when either seekEnd or limit is not specified, as this call will run indefinitely until either throw or return is called, seekEnd or limit is reached, or audit.stop({ force: true }) is called.

Metrics

Metrics will need to be specc'd out further. However, currently, metrics are indexed by a MetricPath similar to AuditEvents. However, they are not stored in the DB, but rather derived from data within the DB.

API

The basic API returns a metric based on a topicPath and allows for the input of seek and seekEnd to specify a specific timeframe for the metric results. Metrics will have to be implemented on a case by case basis.

async function getAuditMetric(topicPath: Array<string>, options: { seek?: EventId, seekEnd?: EventId }): Promise<AuditMetric>

Possibly Relevant Metrics

Some Specific Metrics Include:

Additional context

Tasks

  1. Expose metrics from Polykey domains as events
  2. Use Observables to convert EventTargets to Streams
  3. Convert Observable Streams to WebStreams/AsyncIterables for Usage with JS-RPC.
CMCDragonkai commented 10 months ago

Some important terminology to avoid being confused here.

  1. Log - a log is just a record of something happening - they could be operational logs, they could be debug level, they could be info level - they are done via js-logger, and we expect to print it to STDERR in real time without buffering, and that is supposed to be collected by an orchestrator for operational analytics. These do not need to represent a state change. Logs are useful to eventually form traces - which can be used to operational observability which is useful for debugging.
  2. Event - these are structured things that are supposed to be reacted to - we have programmatic events like js-events, which serve as the basis for #444 and the implementation of observables in the future. Usually these represent a state change, but not always.
  3. Metrics - these are statistical summarisations. The basic metrics are:

Now there are 2 kinds of things we want to observe:

  1. Operational Observability
  2. User/Action Observability

For this issue, the audit domain is focused on User/Action Observability. Not operational observability.

In that sense, we would want to:

  1. Watch for Events Representing State Change
  2. Record them into DB - thus representing the events as a log
  3. The audit domain only reacts to events by recording them - it doesn't do anything logically
  4. The audit domain can update metrics as new events come into play.

As for operational logs/metrics. Again logs are not kept around, they go to STDERR. However metrics can be kept somewhere in a separate area. There is a discussion about this here: https://github.com/MatrixAI/js-logger/issues/15. It makes sense that something else should be maintaining state of operational observability, not the application itself. That way a focused system can specialise in operational observability. Usually this means something open-telemetry based. Things like memory usage is a good place to start.

One question is whether something is operational or not. Consider tracking node connections. Is this operational or is it a user/application event? It's hard to provide a clear distinction here. For a network monitoring app - it would be part of auditing. For this it is less so. I think though for the purposes of the the testnet and mainnet dashboard, this is something we will need to track in the audit domain.

CMCDragonkai commented 10 months ago

I was thinking that one needs to be able to have a streaming query.

So in some cases you can have a fixed snapshot query which is the default case when going over a rocksdb snapshot.

In other cases you would want an asynciterable over all existing records and any new records that have entered. In this case you have a infinite iterator, that never ends, unless the client decides to stop reading (by destroying it somehow).

CMCDragonkai commented 10 months ago

In other cases you would want an asynciterable over all existing records and any new records that have entered. In this case you have a infinite iterator, that never ends, unless the client decides to stop reading (by destroying it somehow).

To do this you may consider a cursor.

CMCDragonkai commented 10 months ago

image

CMCDragonkai commented 10 months ago

audit domain plan

CMCDragonkai commented 10 months ago

We won't ever expect PK to have to graphing libraries - definitely not in PK CLI - maybe in PK Desktop or PK Mobile - it'd have to be extremely lightweight though, don't want to bloat it up.

But operational metrics will go to grafana.

tegefaulkes commented 10 months ago

My go to for visualisation is https://d3js.org/, It's pretty light weight (280kb) and only needs a canvas or svg to render.

amydevs commented 10 months ago

In other cases you would want an asynciterable over all existing records and any new records that have entered. In this case you have a infinite iterator, that never ends, unless the client decides to stop reading (by destroying it somehow).

To do this you may consider a cursor.

@CMCDragonkai should the seeking with the cursor include the element with the id that you seeked?

CMCDragonkai commented 10 months ago

In other cases you would want an asynciterable over all existing records and any new records that have entered. In this case you have a infinite iterator, that never ends, unless the client decides to stop reading (by destroying it somehow).

To do this you may consider a cursor.

@CMCDragonkai should the seeking with the cursor include the element with the id that you seeked?

Yes it's always inclusive. And if there is an "ending seek", you always do inclusive then exclusive. It's the pythonic way. Actually it's standard math notation for range selection.

CMCDragonkai commented 10 months ago

@amydevs I've added related issues #179 cause that was there, you should always do a quick search on the board. You should review that too, close it if you can incorporate its tasks/spec into here.

amydevs commented 10 months ago

getAuditEvents and getAuditEventsLongRunning need not be combined. The reason for this is that the transaction of getAuditEvents is parameterized, whilst getAuditEventsLongRunning takes multiple transaction snapshots to be able to support live data. Instead, this functionality will be combined in the associated RPC handler, where the transaction is abstracted away.

CMCDragonkai commented 10 months ago

If you are streaming the results live, shouldn't you just abstract it all in the handler and only need 1 handler?

amydevs commented 10 months ago

If you are streaming the results live, shouldn't you just abstract it all in the handler and only need 1 handler?

yes, i've abstracted it so that only one handler is used. The one handler will appropriately switch between the long running and normal version of getAuditEvents.

amydevs commented 10 months ago

I've got the js-rpc handlers working, still need to write them into the spec.

The last thing left to do is to rework how metrics are captured with rolling averages, etc.

https://github.com/MatrixAI/Polykey-CLI/issues/40#issuecomment-1818090508

Otherwise, this should be ready to merge after a squash.