Closed amydevs closed 10 months ago
Some important terminology to avoid being confused here.
Now there are 2 kinds of things we want to observe:
For this issue, the audit
domain is focused on User/Action Observability. Not operational observability.
In that sense, we would want to:
As for operational logs/metrics. Again logs are not kept around, they go to STDERR. However metrics can be kept somewhere in a separate area. There is a discussion about this here: https://github.com/MatrixAI/js-logger/issues/15. It makes sense that something else should be maintaining state of operational observability, not the application itself. That way a focused system can specialise in operational observability. Usually this means something open-telemetry based. Things like memory usage is a good place to start.
One question is whether something is operational or not. Consider tracking node connections. Is this operational or is it a user/application event? It's hard to provide a clear distinction here. For a network monitoring app - it would be part of auditing. For this it is less so. I think though for the purposes of the the testnet and mainnet dashboard, this is something we will need to track in the audit
domain.
I was thinking that one needs to be able to have a streaming query.
So in some cases you can have a fixed snapshot query which is the default case when going over a rocksdb snapshot.
In other cases you would want an asynciterable over all existing records and any new records that have entered. In this case you have a infinite iterator, that never ends, unless the client decides to stop reading (by destroying it somehow).
In other cases you would want an asynciterable over all existing records and any new records that have entered. In this case you have a infinite iterator, that never ends, unless the client decides to stop reading (by destroying it somehow).
To do this you may consider a cursor.
We won't ever expect PK to have to graphing libraries - definitely not in PK CLI - maybe in PK Desktop or PK Mobile - it'd have to be extremely lightweight though, don't want to bloat it up.
But operational metrics will go to grafana.
My go to for visualisation is https://d3js.org/, It's pretty light weight (280kb) and only needs a canvas or svg to render.
In other cases you would want an asynciterable over all existing records and any new records that have entered. In this case you have a infinite iterator, that never ends, unless the client decides to stop reading (by destroying it somehow).
To do this you may consider a cursor.
@CMCDragonkai should the seeking with the cursor include the element with the id that you seeked?
In other cases you would want an asynciterable over all existing records and any new records that have entered. In this case you have a infinite iterator, that never ends, unless the client decides to stop reading (by destroying it somehow).
To do this you may consider a cursor.
@CMCDragonkai should the seeking with the cursor include the element with the id that you seeked?
Yes it's always inclusive. And if there is an "ending seek", you always do inclusive then exclusive. It's the pythonic way. Actually it's standard math notation for range selection.
@amydevs I've added related issues #179 cause that was there, you should always do a quick search on the board. You should review that too, close it if you can incorporate its tasks/spec into here.
getAuditEvents and getAuditEventsLongRunning need not be combined. The reason for this is that the transaction of getAuditEvents
is parameterized, whilst getAuditEventsLongRunning
takes multiple transaction snapshots to be able to support live data. Instead, this functionality will be combined in the associated RPC handler, where the transaction is abstracted away.
If you are streaming the results live, shouldn't you just abstract it all in the handler and only need 1 handler?
If you are streaming the results live, shouldn't you just abstract it all in the handler and only need 1 handler?
yes, i've abstracted it so that only one handler is used. The one handler will appropriately switch between the long running and normal version of getAuditEvents
.
I've got the js-rpc handlers working, still need to write them into the spec.
The last thing left to do is to rework how metrics are captured with rolling averages, etc.
https://github.com/MatrixAI/Polykey-CLI/issues/40#issuecomment-1818090508
Otherwise, this should be ready to merge after a squash.
Specification
The
audit
domain of Polykey will be responsible for the auditing user-behaviour of Polykey nodes.It should expose
JS-RPC
server-streaming handlers that yield audit events, and can also provide summary metrics of those audit events.The subdomains of
audit
should be based on the domains of Polykey itself, and the audit domain should also contain Observable properties (see #444) derived from the events dispatched by each subdomain.The
JS-RPC
API should be available viaJS-WS
, so that it is accessible from the services like the mainnet/testnet status page (see #599).Furthermore, this
JS-RPC
API should replay all accumulated state for each metric upon initial opening of the server-streaming call, so that if the any connected services were to restart, they would be able to get all the existing metric data, much like how the rxjs shareReplay funciton.Audit Events
Event Flow
Domain class instances will be injected dependencies into the Audit domain. This means that the other domains will be able to expose any data they want to record via Events without any semantics regarding the Audit domain. The Audit domain can listen to these events and record them in the database.
Database Schema
Using
js-db
, there will be several level for theAudit
domain:audit/
- The base Levelaudit/topic/{topicId}
- Topics Levelaudit/events/{eventId}
- Events leveleventId
s will be made usingIdSortable
, so that they are completely monotonic. Furthermore, events can be accessed by iterating over a topic level (audit/topic/{topicId}
), which yields multipleeventId
s. This will be used to reference the events stored in theaudit/events/{eventId}
. By doing this, events are able to be apart of multiple topics as well.Topics can be nested meaning that querying topic path of
['node', 'connection']
will return all audit events from their children (['node', 'connection', 'reverse']
and['node', 'connection', 'forward']
)API
The basic API will use an AsyncGenerator that yields events from a specified topic:
The options offer pagination, where the user can limit the number of audit events that the generator will yield and call the generator again with
seek
set to the EventId of the last element that was yielded.The second API method yields events live as they are being dispatched:
The reason why the paramaters are different is because that the iteration of new events beyond what is currently stored within the DB cannot be in any order other than chronologically ascending. Furthermore, as this method requires for multiple db transaction snapshots, there is no point for the caller to pass in a transaction to perform on. Note that
generator.return()
orgenerator.throw()
must be called on the returnedAsyncGenerator
when eitherseekEnd
orlimit
is not specified, as this call will run indefinitely until eitherthrow
orreturn
is called,seekEnd
orlimit
is reached, oraudit.stop({ force: true })
is called.Metrics
Metrics will need to be specc'd out further. However, currently, metrics are indexed by a
MetricPath
similar toAuditEvents
. However, they are not stored in the DB, but rather derived from data within the DB.API
The basic API returns a metric based on a
topicPath
and allows for the input ofseek
andseekEnd
to specify a specific timeframe for the metric results. Metrics will have to be implemented on a case by case basis.Possibly Relevant Metrics
Some Specific Metrics Include:
Additional context
237
179 - an old issue about audit logging
Tasks
JS-RPC
.