[2014] Discussion: ChatOps

jfryman commented 10 years ago

Hi Everyone! I am super stoked to be here working with each of you. The goals of StackStorm are very much near and dear to my heart, and I am very much looking forward to helping make this an amazing and transformational product for many IT departments over the coming months/years.

To that end, I am jotting down some initial thoughts of my POV on some of the topics that I think will set us apart from any competition, and really usher in the Third Wave of Automaton (tm) as Evan and Dmitri have been branding our solution. At the very least, it’ll be a way to share where my head is currently at while we talk about priorities and direction.

ChatOps

I :heart: ChatOps. I honestly cannot say enough amazing things about it. If ChatOps as a concept is relatively new to you, I encourage you to take a look at a few resources. If you’re into the video thing, there is a great chat about ChatOps done by my colleague Mark Imbriaco: https://www.youtube.com/watch?v=pCVvYCjvoZI. If you’re into the ‘read a wall of text game’, then take a look at this: http://puppetlabs.com/blog/really-building-data-driven-infrastructure

I have been working and consulting in various capacities over the last decade working to usher in automation practices to large companies and small companies alike. Without a doubt, the single biggest barrier to implementing any meaningful change is cultural in nature. This comes in two flavors:

We do not trust the machines to do full-stack automation
There is too much to automate, and the mountain of work incapacitates any decision making.

There are also a ton of subtle undertones that exist there, everything from admins being afraid of loosing jobs to a process heavy culture who feels they are unable to take advantage of automation for fear of disrupting delivery. However, what we are trying to

ChatOps becomes the great unifier. What ChatOps does is more than a technology. It’s a new way to communicate how work gets done in an organization. Using ChatOps in an organization exposes users at all levels what is actually happening behind the scenes. This is important because the first barrier to a good automation implementation is exposing the myth that ‘computers are magical’ and making it meaningless. With ChatOps, Operations becomes significantly more transparent to all parties, and actual conversation can happen about business delivery as opposed to debating how the sausage is made.

The long-term benefit of ChatOps is that after some amount of time, the same commands that are executed over and over again begin to inspire trust. This is the critical inflection point that we are aiming for. Once this happens, these commands can start to be moved to the actual goal of StackStorm - fully automated and reactive operations.

I firmly believe that ChatOps is the bridge to the end-state. It is the tool that allows the real change that needs to happen - the cultural change. At the core, StackStorm is a great tool to be a part of the Third Wave, but not every company/IT Professional is ready for the end state. This is why ChatOps will set us apart. Not only are we helping enable the real end-game, we’re outlining a path for how to actually get there.

ChatOps is our bottom-up strategy. ChatOps excites technology workers by giving them tools to spread operational load to the people that need it the most. It becomes the vessel where the operator begins training the robot as opposed to being the person.

Another shameless plug: I have these conversations all the time with folks that are still unsure how DevOps will help them. I talk about all of these concepts in my recent talk in Berlin: http://vimeo.com/110484640. We have a huge battle to fight in the hearts and minds of IT Departments, because the hardest problem we face will not be the technology stack.

Targets to Attack

ChatOps needs to be a first-class citizen in our tool. To me, this means a few things:

Solid Bot integrations: We need to be ready to allow the popular Chat Clients pretty much plug-and-play for new users. I think we can limit the scope a bit, but at the very least, this means:
- Hubot (Coffeescript)
- Lita (Ruby)
- Err (Python)
Solid Chat Integrations: Part of having a good Bot workflow is having good awareness of where the conversations are happening. Each chat room has its own nuances of how to interface. Again, to me this is:
- HipChat
- Slack
- IRC
- Lync/Skype
ChatOps attributes built into every workflow:
- Toggle whether a workflow can be executed in ChatOps. Not every workflow should be ChatOps'd
- Ensure that ChatOps output is possible without too much effort. Workflows should be able to inform only to Chat as opposed to being an executor response
- Limit where ChatOps can be run.
- Users often need to be able to assert that certain ChatOps commands are executed only in specific places. Keeps visibility high.
- RBAC. The Hard part, but going to be necessary for many enterprise clients. (LDAP, oAuth, or other Authentication integrations)
Surprise and Delight! The way we get traction here is to make the experience magical out of the box. We should have common stories for some basic operational tasks:
- App Deployment
- Monitoring Management (Nagios/Sensu/etc)
- Graph Viewing (Graphite/OpenTSDB/Librato)

To re-iterate. ChatOps is our bottom-up story. That means we have to be solid here to entice operators and developers to be excited about using our platform. I shared this with Evan and Dmitri at dinner the other night - for years, PuppetLabs sold their product with with a single story… “Install Puppet and manage sudoers”. While simple, it got them in the door and on the road to bigger things.

I have another note in progress about a potential top-down story. More on that shortly.

manasdk commented 10 years ago

This is a great write-up. Thanks for sharing your thoughts.

I guess it is fair to say that we are all keen on enabling ChatOps via StackStorm. The video link you posted certainly convinces me that ChatOps is a critical point for StackStorm to weave itself into daily operations.

epowell101 commented 10 years ago

James

Great thoughts. Again, welcome!

It'll take a little bit to get you up to speed on all we've learned in the last year and on the difficult trade offs we have had to make about priorities thus far. We are aligned in general on opportunities and requirements otherwise we wouldn't be working together :)

More specifically, as Manas points out, I think we are all supportive of the power of ChatOps and keen to support it. And I also think we share a sense that we have lots more to do than we have resources and time to get it done right now; hello start-up reality! The trick is to make the tough trade offs with rigor and transparency.

Before getting to the brass tax of prioritizing to dos, a little context. Specifically - who do we think are our users and customers?

There are at least two axis one can use to answer that question: 1. Personas - who are these people? and 2. Firms - where do they work?

Personas. Who exactly are the users we are after? After some discussion, much of which is documented on Confluence, we've arrived at three focus personas: 1. The VP of Tech Ops. 2. The ops guy. This is a consumer of automation who wants their 2am pages to be a) worth waking up for and b) easier to resolve thanks to context. Note that we need to determine who is the equivalent of the ops guy in the CI/CD use case - is it the QA operator or just generically someone less technical than persona st2stanley/discussions#3. 3. The SRE. This is basically you or Patrick or anyone else who has responsibility for building and managing an increasingly automated environment.

We have used these personas to craft messaging and our product direction. For example, when faced with the reality that in the time allotted before Paris we w/d not be able to address all personas, we decided to focus marketing slightly more broadly but to narrowly focus recent product development on SREs. This meant we left out certain capabilities that we thought mainly appealed to the VPs and the ops guys in favor of those that we thought could make StackStorm 0.5 appealing to SREs.

Firms: Crossing the chasm and mainstream DevOps adoption. The dynamics of crossing the chasm are profoundly important to understand the creation and focus of start-ups. Much hinges on how to choose the right early adopters in such a way that the product offering and overall company brand can, over time, be made more appealing to the mainstream while setting the stage for this chasm crossing through strategically sound positioning and messaging. By thinking about market dynamics, we add another dimension to the question of persona focus because now we are talking about early users being SREs at early adopters whereas eventually - perhaps in ~24 months - I believe we'll be scaling StackStorm by selling larger licenses to VPs of Ops in financials and other forward leaning enterprises of the much more mainstream global 2000.

So - what do we do now? When rank ordering next features, do we put RBAC for example ahead of ChatOps? Or do we put GUI ahead of YAML replacing JSON for instance?

The above context will help us make these decisions. Let's be crisp about who we are targeting w/ each feature - and also put each capability into a view of total product (shameless plug for reviewers to take a look at my total product check list on Confluence) while also incorporating potential bang for the buck (how hard is it to get something that delights users).

With this in mind, we would put a table together similar to the following (historic examples are available from the board decks on Confluence and all '+'s must be the result of everyone's input, this is just an example):

Feature: SRE appeal VP of Ops appeal Ops appeal Early early adopter appeal Total product fit (one of the use cases we want to support end to end) Cost / benefit (how easily c/d we do something cool) ChatOps: +++ ++ + ++ +++ ++ RBAC: ? + ? + ++ ? GUI 1.0: ? + +++ ? ++ +++ MANY MORE HERE…..

And using the same approach, we could rank ISV integrations such as:

Integration: SRE appeal VP of Ops appeal Ops appeal Early early adopter appeal Total product fit Cost / benefit (for example, we think New Relic is friendly to us) New Relic: ++ + ++ ++ +++ (supports more than 1 IMO) +++ Electric cloud ? + ? + + ? MANY MORE HERE…...

While the above are just examples, my sense is that actual analysis will rank ChatOps at or very much towards the top of the list for next priorities as per my rough example.

However, I think the rigor and transparency of an approach to setting priorities substantially similar to the above will serve us well.

Happy Tuesday! Welcome to your 7th day as a Stormer.

Evan 415 377 9812

On Nov 10, 2014, at 5:49 PM, Manas Kelshikar wrote:

This is a great write-up. Thanks for sharing your thoughts.

I guess it is fair to say that we are all keen on enabling ChatOps via StackStorm. The video link you posted certainly convinces me that ChatOps is a critical point for StackStorm to weave itself into daily operations.

— Reply to this email directly or view it on GitHub.

m4dcoder commented 10 years ago

Excellent thoughts here and use of github for async discussions. Maybe we should add skype to the list of chat integrations above? http://venturebeat.com/2014/11/11/microsoft-will-replace-lync-with-skype-for-business-in-the-first-half-of-2015/

lakshmi-kannan commented 10 years ago

I liked the entire discussion and +1 to github style as opposed to Wiki.

I want to play devil's advocate and I want to express my problems with chat ops for serious production debugging and remediation. I've worked in two distinct kind of environments

1) A ticket based system (think pager duty but better) where a primary on-call is assigned based on a schedule. This system is publicly available across the company. All information about the problem is consolidated in the ticket.

(2) Chat ops is the source of truth for production issues with ticketing system just there for tracking whether the problem was resolved or not.

My biggest problem with (2) is the amount of chaos and lack of ownership. This is not necessarily a bad thing for certain orgs but for some serious enterprise customers, it might be. Plus establishing the timeline of events and what actions were taken becomes really really hard because of the noise (This problem can be solved if st2 is the only way to invoke actions.). The other bad thing is that people outside engineer/dev ops teams cannot reasonably understand the status of resolution. For example, a project manager whom we worked with constantly came to one of us and it was on us to give her the exact words that need to be sent out to customers. Many of these boil down to establishing a clear and concise timeline of events, observations and actions. In a ticketing system, the primary on-call is responsible for outlining these as well as solving the problem. She/he might seek help in chat. Chat ops for me is mostly confusing because people invoke actions they are not supposed to. This might be a cultural issue and usually we enforce a two +1s rule while debugging production issues.

So I'd want us to keep in mind that a good chat ops integration doesn't just stop with invoking actions from chat. It has to cover much more to be the best.

@epowell101:

Or do we put GUI ahead of YAML replacing JSON for instance?

FWIW, we support both yaml and json now. This was fixed end of last week :).

However, I think the rigor and transparency of an approach to setting priorities substantially similar to the above will serve us well.

I'd like to see this. I can understand when sometimes intuition takes over a set of guidelines but let's please establish some guidelines. I am hoping you guys have more input from Paris conference.

DoriftoShoes commented 10 years ago

ChatOps is amazing for the live context but does get messy when there are questions and/or not relevant comments mixed in with event comments. This is where our history/audit plays in. We give them a rich set of chronologically ordered events without random chat comments.

jfryman commented 10 years ago

Maybe we should add skype to the list of chat integrations above? http://venturebeat.com/2014/11/11/microsoft-will-replace-lync-with-skype-for-business-in-the-first-half-of-2015/

@m4dcoder Oh wow! TIL. Then yes, Skype too. Anecdotally, a good number of participants at OpenStack Paris said their chat platform was Skype. Seems like all signs are pointing that way. :+1:

dzimine commented 10 years ago

General Comments

+1 to epowell101 on specifics on discipline about priorities and selecting solutions.

I see ChatOps as a solution, may be one of the most important, one of the key operation patterns that we can lead with. Will likely rank high, let's see, as Evan suggested.

Some details on ChatOps

@jfryman how do you see the priorities within ChatOps? E.g., is Hubot enough for now? No? What shall trigger our going after Lyta and Err?

ChatOps attributes built into every workflow .... Ensure that ChatOps output is possible without too much effort.

Good point, we need to think of "platform level support" for integrating workflow with ChatOps, so that some ChatOps output will a "turn-key" once ChatOps is enabled (better than manually adding hubot-say to every step).

Surprise and Delight! The way we get traction here is to make the experience magical out of the box.

I agree we want it, but how do you think we go about it? If we assume users already have app deployment/monitoring/graphing tools, will it be a set of integrations and patterns which they need to adjust to their environment?

jfryman commented 10 years ago

@jfryman how do you see the priorities within ChatOps? E.g., is Hubot enough for now? No? What shall trigger our going after Lyta and Err?

I think it would suffice to have only Hubot as we prioritize. I like to think that we go after the other bots after we have Hubot integration down pat. I think Hubot is going to be our first :hammer:, and we're going to learn a lot about all the nuances of integration as we go down the road. This is assuming nothing pops out to us between now and then... dat #startuplife. :grin:

I agree we want it, but how do you think we go about it? If we assume users already have app deployment/monitoring/graphing tools, will it be a set of integrations and patterns which they need to adjust to their environment?

Yes. I envision introspection being a first driver. Part of the adoption of these tools is to start out with passive, non-threatening/destructive actions in queue. I'd love to brainstorm about a limited set of tools that we go after first round (/cc @DoriftoShoes and anyone else interested rapping about this). We might be able to also ask our first round friends/customers what their monitoring/logging/graphing stack looks like, and tailor it to them. Get them excited about what's coming by providing value now.

The introspection story will be relatively simple, but we should come up with stories. I certainly have a set in my head, but again I'd love to collaborate a bit on that initial story set and see what might make most sense.

This will also converge with the CI/CD story, which should also be ChatOps'd. Power to deploy, power to inspect. Once these are done, we can start aligning more complex workflows with some of our customers. I think these stories will start to naturally highlight themselves as we get further down this road.

DoriftoShoes commented 10 years ago

The introspection workflow falls into my diagnostic vs remediation workflow model almost perfectly. Essentially giving them access to only diagnostic or data gathering workflows initially. This comes back around to a fundamental piece within the system that we need though...tagging. Without the ability to tag it is really hard to provide access controls at the action/workflow level.

Even earlier than that problem though, we need to address the usability issues around chatops (maybe hubot specific, I don't know). Our platform is quite powerful and exposes a lot of metadata about actions. We need an easy way to view and use that data from within chat. The current method of listing parameters in help and giving them values at run time is not a pleasant experience.

I actually feel that the chat clients should emit triggers into the system. We can then fine tune controls, and tweak outputs accordingly. We could also build a much cleaner 'help' from within the chat client, and even do chat service specific formatting server-side before sending the data back.

jfryman commented 10 years ago

I actually feel that the chat clients should emit triggers into the system. We can then fine tune controls, and tweak outputs accordingly. We could also build a much cleaner 'help' from within the chat client, and even do chat service specific formatting server-side before sending the data back.

+:100:

The more comprehensive our back-end is for handling chatops, the simpler the bot/service integrations can be. Should make it easier to plug in new services as customer demands require us to do.

DoriftoShoes commented 10 years ago

My initial thoughts actually go the other way. Let's initially treat chat messages as their own triggers. No different than anything else posted to the webhook...but it is two fold.

STEP 1) I want the incoming messages from chatops treated like an 'event' that triggers actions. Right now they trigger actions directly. We can set up rules to allow certain actions, but not other ones easily this way. Next, we write an action that actually returns the help info in a format formatted specifically to the requesting chat client.

STEP 2) We improve our native notification mechanism. Whether it is chat, or email, or Jira, we should be able to send the output of each action in a workflow (or single action) out to an external source without having to call a separate action...but I do not view this as chatops specific. I think this should work for whatever platform the end users wants as their 'collaborative notification' tool.

EDIT: My basic premise is let's start by NOT treating chatops as it's own separate entity and actually treat it like an event. Then we figure out how to natively integrate notification mechanisms in our actionrunners.

But I am pretty sure we will need tagging for all of this.

DoriftoShoes commented 10 years ago

Yeah we are probably violently agreeing on this one...hahahha. We can do a Hangout to spitball this a bit more.

StackStorm / community