Joystream / atlas

Whitelabel consumer and publisher experience for Joystream
https://www.joystream.org
GNU General Public License v3.0
100 stars 44 forks source link

Idea: DAO Infrastructure Logging v2 #3980

Closed bedeho closed 1 year ago

bedeho commented 1 year ago

Background

Infrastructure logging, by which I mean logging failure events and metrics for interactions with DAO infrastrucutre, specifically Storage and Delivery nodes (not Orion), currently is both very limited - in fact there are no metrics, and it also does not get distributed to the relevant DAO participants, specifically leads and/or the council. This means that when there is a problem, either system wide or with individual providers, the DAO authorities have no automated way to detecting this.

Proposal

This proposal has end users user orionv2 as proxy for errors, this seems to me to be better for privacy, and also means we can sidestep the need for complex authentication schemes of user requests on the logging side. I assume the volume of logging here will be quite low overall, so the cost of the proxying hopefully will be a non-issue for the medium term at least.

Sentry

We already have a sentry logging service that is for the benefit of the operator, my suggestion is that we leave this intact and separate for now. It can be used more for application experience specific logging, not DAO infrastructure stuff.

Steps

We should also probably ask current infra leads what the best way is for them to receive such logging data.

  1. Peer review from @attemka and @Lezek123
  2. propose new set of specific events
  3. propose new set of specific metrics
  4. propose Orionv2 API & Lead API: This is extremely important to specify as a standalone openAPI or GraphQL schema that can serve as a separate standard, otherwise we will risk leads starting to build proprietary undocumented APIs for their own private data silos and tools, which is very bad.
  5. propose configuration format
traumschule commented 1 year ago

SWG uses prometheus, DWG the default elastic config. Each Atlas reports to the different orionX instances which is not accessible/known to leads. The DAO should provide their own logging server and ask GW ops to configure their atlas to report to the DAO ES instance. Pioneer needs a page to surface ES logs.

It is clear that in the future not all consumers will use atlas.

bedeho commented 1 year ago

The DAO should provide their own logging server and ask GW ops to configure their atlas to report to the DAO ES instance.

I have no opinion on whether it is ES specifically, but yes, ultimately logging in app backends (like Orion) must be pooled into some shared DAO datawarehouse for analysis and review.

Pioneer needs a page to surface ES logs.

ES logs have no place in Pioneer, that is not what its for.

It is clear that in the future not all consumers will use atlas.

Agree, but starting in Atlas makes sense, whatever API and events and metrics we start there can be adopted by any other app that wants to, and Orion can support the same API with any other app that also uses Orion.

bedeho commented 1 year ago

It struck me that, one way we can sidestep

  1. possibility of write deadlocking between processor and logging, causing processing and logging delays
  2. need for making new dashboards for operator to view their local metrics and data
  3. write new mappings in Orion

would be if logging did not actually go into Orion, but just into another operators specific elastic search backend. This ability to do so would depend on

a. whether it would be possible for the Eleastic search instance to only accept data from users authenticated by Orion, as we don't want anyone to be able to submit anything: this would actually not even be strictly needed for v1 of this scheme, but knowing it was feasible would be very useful. b. whether it would be possible to run some process, either part of the ES instance or external to it, which would relay the relevant information to the lead instance when appropriate. This would be needed even if Orionv2 was to be used though. It would not surprise me if there was tooling in ES ecosystem for doing this sort of relay, but I don't know.

If we did it this way

What do you think?

attemka commented 1 year ago

@bedeho thank you for this, this is an interesting idea, as it removes the unnecessary "middle-man" orion, you can setup a custom authorisation realm in the ES , that means that the communication between orion and distribution ES will be only for auth goals, and the rest will be done between the storage\distribution bucket and distibution ES. Sounds great, I'll discuss that with @Lezek123 tomorrow.

bedeho commented 1 year ago

Awesome, and as I said, we only really need to know that this is possible in a natural way, we can ship without this, as user accounts will anyway take more time to go into production, as it requires other work in design, orion and atlas engineering. Also, at this early time, no one has a reason to send fake data to app backends.

bedeho commented 1 year ago

It occured to me that this logging probably should have attestation, so that data can be used for making decisions about sanctions or rewards for different actors. So this means all messages come from an app operator and being sent to a lead logging service, must be signed using some suitable operator key. It may be fine for operator to not actually check the signatures during normal operations, just store everything, but later for forensic purposes, they would be available.

attemka commented 1 year ago

@bedeho this is how I see the logs architecture. The goal was to reduce the amount of info proxied through orion and exchange the logs directly between the atlas\buckets and distribution lead logging instance.

Screenshot 2023-04-17 at 23 14 13

bedeho commented 1 year ago

Thank you for providing this, but unfortunately I don't think this covers most of what I'm looking to understand, and it has a lot of content which not sufficientl clear just from the diagram.

  1. It is currently not specified here, although we have talked about this before, but where is this data being added to Orion and where is it being added to "lead logging tool". In the beginning we talked about putting logging data literally in Orion dbase, but then we talked about ELK stack. I don't know what was concluded on lead side. Please state exactly what sort of infra this is for accepting messages and storing them.
  2. Can you write an explicit list of what messages Atlas sends to Orion logger, and also explain how authentication works, if at all on submitting such messages
  3. Can you write an explicit list of what messages Orion logger sends to lead tool
  4. Can you write exactly by what process Orion takes the messages that arrive to it's dbase, and generates messages for lead logger. How does this work? Its this process that generates privacy, as turns user messages into public messages that leads and others can see. 5.There seems to be lots of data in the diagram which just pertains to what sort of generic data nodes pass to each other, like files, like nodes with assets. Unless this information is relevant to explaining the logging architecture, I would leave it out, I don't know what it's presence here means currently.
  5. What are these blacklists being mentioned?
  6. Why are you saying lead logging tool sends blacklists to orion logger? that doesnt make sense? confused on why lead logger is sending any data at all infact.
  7. Why are storage and disrtibutor data being combined into one lead dbase? Shouldn't we at least delay worrying about storage at all, let alone combine into one data store?

exchange the logs directly between the atlas\buckets and distribution lead logging instance.

I did not understand what this meant.

attemka commented 1 year ago

@bedeho

It is currently not specified here, although we have talked about this before, but where is this data being added to Orion and where is it being added to "lead logging tool". In the beginning we talked about putting logging data literally in Orion dbase, but then we talked about ELK stack. I don't know what was concluded on lead side. Please state exactly what sort of infra this is for accepting messages and storing them.

I prefer using some tool on the Distribution lead's (DL further on) side, my first attempt will be to use ELK, because it provides flexibility for users auth, logs usage etc.

Can you write an explicit list of what messages Atlas sends to Orion logger, and also explain how authentication works, if at all on submitting such messages

Atlas will communicate with both Orion and ELK(or replacement tool if there will be some implementation issues, but I'll call it ELK further on for simplicity), but the data provided for these 2 instances will be different (as on screenshot). On Orion side we'll mostly need to receive the data which will help Orion to adjust whitelist\blacklist of the nodes for the user and properly generate list of assets for the user. Most of the information about download times etc. will be gathered by the ELK in order to provide as much data as possible for distributors lead. Then ELK will share it with Orion if needed. Regarding the authorisation - no extra layer needed between Orion and Atlas, and all the communication with the ELK will be done with the custom relay implemented on Orion side. Messages, which are planned to be sent from Atlas to Orion are mentioned in the scheme, it's the node-status - used to report individual user's connection issues to generate personal blacklist; user-geolocation - used to get the user's location to pick the optimal provider (doesn't mean that the closest one will be picked)

Can you write an explicit list of what messages Orion logger sends to lead tool

I'm not sure that this will be a final version of the list, but for now those metrics are: bucket-id - which bucket was used time-to-interaction - how much time passed from the user's request to the actual start of the download (ping+bucket system load) asset-download-time - how long it took to download the asset - will show the connection\speed quality for individual user

Can you write exactly by what process Orion takes the messages that arrive to it's dbase, and generates messages for lead logger. How does this work? Its this process that generates privacy, as turns user messages into public messages that leads and others can see.

if I got this correctly - Orion in this logic will only receive the individual user's statistics, if some user's cant connect to the certain node etc. It'll then regenerate node list for them and share the statistics with the DL node.

There seems to be lots of data in the diagram which just pertains to what sort of generic data nodes pass to each other, like files, like nodes with assets. Unless this information is relevant to explaining the logging architecture, I would leave it out, I don't know what it's presence here means currently.

Do you mean that diagram I've posted above? If so, could you please point me which data are you referring to?

What are these blacklists being mentioned?

If node is unavailable for all of the users (e.g shut down, or just returning 404\500 etc.) - it'll appear in the DL's node blacklist, DL should be able to track those nodes to resolve the issues, also these nodes ids will be shared with Orion, in order to exclude these nodes from everyones list. Also, there could be a situation when a certain user can't access some certain node (provider\DNS issues etc.), while others can. In this case, nodeId will be passed from Atlas to Orion, so Orion will form an individual user's blacklist, which will contain the nodes that nobody can access and the nodes that this individual user can't access.

Why are you saying lead logging tool sends blacklists to orion logger? that doesnt make sense? confused on why lead logger is sending any data at all infact.

The goal was not to use Orion as a proxy hub for all the nodes, so most of the info\logs are shared directly between instances. DL's node will still need to gather all the info from the buckets and will get the current status of the nodes. Otherwise, Orion will be used as a proxy hub between Atlas, DL node and the buckets, which is also a case, but it'll raise the requests amount and the load.

Why are storage and distributor data being combined into one lead dbase? Shouldn't we at least delay worrying about storage at all, let alone combine into one data store?

this can be postponed, however most of the logic would be shared between the distribution buckets and the storage buckets, so rolling this to all of them allows to gather the statistics and info for all of them.

exchange the logs directly between the atlas\buckets and distribution lead logging instance.

See the blacklists answer

bedeho commented 1 year ago

Orion side we'll mostly need to receive the data which will help Orion to adjust whitelist\blacklist of the nodes for the user and

  1. even if this is a good idea, isn't it better to skip now? The goal here was not to make Orion asset resolution logic dynamic, that is its own complex problem, it was just to log data.
  2. Why send data two places for this purpose? Seems more sensible to only put all data in one place, and then there can be a distinct independent mechanism from turning data into decisions about how to resolve assets, and this logic does not itself need to live inside of Orion, as that makes Orion more complex, less customisable, when there is no direct need for this. One could even imagine some input control information coming from lead to inform this in the future, not purely based on local visitor information in one Orion instance.

Regarding the authorisation - no extra layer needed between Orion and Atlas, and all the communication with the ELK will be done with the custom relay implemented on Orion side.

What does this mean? this means all logging data first comes into Orion, and then Orion sends it into ELK if token is valid? This means some new API endpoint for logging is bein gadded to Orion?

I'm not sure that this will be a final version of the list, but for now those metrics are

I don't see anything in this list which obviously covers the nuances of the various errors, nor is it clear to me when these messages are sent. Obviously for example asset-download-time is not sent for every single individual asset ever download?, that would put huge load on the server. Also not clear to me what messages are only sent from app => app backend, and which are sent from app-backend to lead backend.

This was one of the first things asked for in terms of what should be isolated, its important and easy to define, easier than many of the other design decisions. Can we please expedite this before anythign else?

I also think we should make it concrete what transport is being used here, are these distinct HTTP requests, or is this using websockets, or something else. This has implications for what kind of messages policy is feasible to support without performance degrading.

if I got this correctly - Orion in this logic will only receive the individual user's statistics, if some user's cant connect to the certain node etc. It'll then regenerate node list for them and share the statistics with the DL node.

No, lead also needs pretty much same data, otherwise human Orion operator and humanlead will have to start talking to each other about reporting various distributors being slow or failing. Again, the dynamic node list generation is actually 100% orthogonal to what this feature is about at its core, which is generating information for human operators. The key point here is that lead cannot receive personal metrics, like "User 12 tried to watch video about cooking from distributro 5, but it never loaded", because this leaks privacy. So lead must have anonymised information only, and probably only aggregate metrics, not individual sessions. I am not sure we are in synch on this.

Also, there could be a situation when a certain user can't access some certain node (provider\DNS issues etc.), while others can. In this case, nodeId will be passed from Atlas to Orion, so Orion will form an individual user's blacklist, which will contain the nodes that nobody can access and the nodes that this individual user can't access.

I don't think it is advisable to attempt this now, this is complex, both the global and the user specific blacklist. I also believe it is redundant - at least the global one, there is an actual on-chain notion of marking distributors. I think if this is missing, we better add it, because having bifurcated information about the same distributor on-chain(metdata, host) vs (blacklisted or not) is a bad idea. Someone looking directly at this data, e.g. for another app that does not use Orion, will be blind to this.

I don't know how you plan to maintain a user specific blacklist in an automated way?


We need to ship this before user accounts, hence the first implementation of this cannot depend on having user accounts, whether ephemeral or memberships, for it to work.

bedeho commented 1 year ago

We had a meeting on the topic, here is hte Miro board: https://miro.com/app/board/uXjVMO5SYHE=/

traumschule commented 1 year ago

https://www.elastic.co/what-is/elk-stack