elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.62k stars 8.22k forks source link

Tracking users in Kibana like a "black box" for retrospective analysis #33874

Closed crayzeigh closed 3 years ago

crayzeigh commented 5 years ago

This comes from the email thread announcing "Tracking user interaction telemetry in Kibana" where I shared some ideas starting in a conversation from John Allspaw around dealing with Human Factors in incident response retrospectives in particular.

@cjcenizal mentioned that this would be good to track ideas in a GitHub issue, here it is.

Description of Features

Preserving the original conversation for posterity:

Originally, from @crayzeigh:

Tracking how a user interacts with the software is a big deal for event Retrospectives as well. I had a good conversation with John Allspaw to this effect the end of last year.

So for an observability use case, the more we could know about how an engineer interacts with kibana in order to troubleshoot an issue, the better we'd be able to recreate the events of an incident response in order to evaluate it retrospectively.

Cool stuff!

Reply from @cjcenizal:

Aaron, this is a really interesting idea. So you're saying it'd be useful to have a "troubleshooting log" of each action an engineer took when troubleshooting a problem. If we were to create these logs continuously it would result in a glut of data, most of which we would never use, so it seems like it would be best if engineers had an option to start recording a troubleshooting log and then stop recording a troubleshooting log. Would you want these logs automatically uploaded somewhere? Or would you just want to give the engineer the option to export the log and then they could email it to someone?

Finally from @crayzeigh:

Oo, didn't think of a start-stop way to handle the glut of data that would otherwise be generated.. that's a nice idea.

I think up-loading? It's hard to say precisely as John has a lot more experience with it and I'm not sure it matters, largely. The idea is to have a sort of Troubleshooting Black Box the same way an airplane has a Black Box to record everything that occurs (both user input and instrument readings) during a flight in order to re-create the moments of trouble for retrospection. So even just logging what dashboard a user loads and what queries they run (including failed and mistyped queries) and the results would be saved in this Black Box log.

When reviewing events you'd now have a new event timeline with real timestamps and real actions (not relying on remembered actions, times or motivations which are almost always wrong) to lay along side all the other events occurring during an outage. Really valuable.

jallspaw commented 5 years ago

Thanks folks - yep this captures the rough gist of what I was mentioning. Certainly, the "glut" is a real concern and the turning on/off would be useful indeed. Alternatively, only keeping (configurable?) n days of the log is also a reasonable thing to do.

This is likely to be enough to help reconstruct an event:

FWIW, I'm going to be pestering other telemetry/monitoring/etc projects to add something like this as well, because the data is (from a Human Factors perspective) critically valuable. :) Thanks!
elasticmachine commented 5 years ago

Pinging @elastic/kibana-platform

LeeDr commented 5 years ago

/cc @Bamieh

I'm adding "audit logging" as a possible search keyword.

skearns64 commented 5 years ago

To me, this feels like an extension to or a use of "full" audit logging in Kibana, and I like thinking of it that way, because it hits a number of important features on the way to something like this and if we build it this way, we set ourselves up to address things like retention history, etc in a consistent way, and this becomes just another thing we can build on top.

jeremyarose commented 5 years ago

Hey folks, I love the idea of doing whatever we can to improve the user experience. Please keep in mind though that any time we're talking about collecting/tracking/monitoring user actions, we need to at least notify the user of what we're doing, and ideally give them the opportunity to opt-out. This is crucial to our privacy by design UX. cc: @legalastic

carlspataro commented 5 years ago

We'd need a degree of precision as to: (i) why we are collecting it; (ii) the extent to which we would retain an ability to reference it back to the individual (i.e., can we or should we anonymize it); (iii) who has access to it; and (iv) how long we retain it.

In the black box analogy---the flight crew are employees of the airline and they know their information is being recorded for a specific (hopefully never needed) purpose. If we are looking at recording customer support calls, we'd need some pretty specific notice/consent--the situation is not quite the same. The customers don't work for us and they would not know (in the absence of decent notice) why we are recording the info.

crayzeigh commented 5 years ago

In my use case or I think we’ve decided to go with the “audit log” moniker, it would be something that the customer uses, enables or disabled on their own, and retains all data private to that customer not sent anywhere else.

The intention is to use the data in a post-incident analysis.

9ASHIPU4Q0V commented 5 years ago

This is a knotty theme that is not likely easily resolved.

1) Current situation Extensive and even aggressive logging of worker behavior is already taking place. Companies provide computational resources and require all sorts of worker agreement and acknowledgement of responsibilities and limitations in the use of those resources as well as explicitly and implicitly indicating that all the use of those resources is owned by and accessible by the company and its agents.

  1. Terminology The terms used in discussion are, to some degree, polarizing. Describing some commands as emerging from 'users', 'operators', or 'workers' evokes quite different responses. We are all 'users' and the discrepancy between our expectations of privacy as individuals and the way the internet world behaves is jarring. Describing the command maker as an 'operator' implies something different as does 'worker' or 'employee'.

  2. Existing range of monitoring It will hardly come as a surprise that the degree to which individual command emitter monitoring occurs varies a lot. Those working in banking or finance have a very much different environment than those in, for example, social media. In some settings, notably national security, revealing that individual keystroke recording is going on doesn't raise an eyebrow.

  3. Influences on behaviors Existing approaches to monitoring and recording already have substantial influence on the architecture of systems and the behavior of those operating them. Companies have record retention policies that manifest their concerns here. Boundaries between who is 'inside' and who is 'outside' are seldom as crisp as security and legal agents would desire. More importantly, people already modify their behaviors in expectation of monitoring and recording. When we explore real incidents we find supposedly 'private' channels, sidebar conversations, telephone messaging, and modified language that reflects the desire to keep these communications out of view. Interestingly, this seems to be more common as we ascend the management chain. [We speculate that this is because the communication by technical workers takes place mainly in order to fix the problem while the senior managerial exchanges are directed towards other concerns.] The point is that people already incorporate some expectations about the visibility of their behaviors into their behaviors.

  4. Limitations of analogs The notion that a "black box" recording will resolve the issues around human and machine performance recurs across different domains. It has been seriously proposed, for example, for hospital operating rooms. Such recording devices are common in transportation settings and present in commercial aircraft, many trains, and, increasingly, in individual vehicles. These are increasingly "computer systems that happen to move" (cf. Csete & Doyle, Reverse Engineering of Biological Complexity, Science, 2002.) The value of these devices in reconstructing both control inputs from humans and the behavior of the computational/mechanical device itself is well demonstrated. It is important to keep in mind that this value depends on a thorough , well-grounded understanding of the detailed engineering characteristics present and an elaborate, expensive, and well developed human and technical system capable of forensic examination of the recordings. This system depends on the relatively static character of the engineered system being monitored. Current business operations in even the most advanced internet-facing systems rarely (never?) have these qualities. The analogy between what is being proposed here and "black box" is probably more misleading than illuminating.

  5. Inference guaranteed! The absence of detailed recording of commands reaching a computer does not keep later observers from making inferences about the actions and intentions of the command-emitter. Post-event evaluations of human performance are made even with very thin evidence. The weight of these evaluations increases with the significance of the event. Attention to performance is low when it doesn't matter but failures of many internet-facing systems today have important consequences. Taking human performance seriously in these settings depends on the ability to understand the activities above and below the line of representation and the need to do this is increasing.

yaronp68 commented 5 years ago

Hey folks, I love the idea of doing whatever we can to improve the user experience. Please keep in mind though that any time we're talking about collecting/tracking/monitoring user actions, we need to at least notify the user of what we're doing, and ideally give them the opportunity to opt-out. This is crucial to our privacy by design UX. cc: @legalastic

We provide a telemetry data preview when the user is asked to opts-in. This was the telemetry behaviour during the entire 6.x and didn't change moving forward

Having said that, Need to check how much information we have for preview when user looks at this preview