StackStorm / community

Async conversation about ideas, planning, roadmap, issues, RFCs, etc around StackStorm
https://stackstorm.com/
Apache License 2.0
8 stars 3 forks source link

Proposal: Move away from JS-based st2chatops #8

Closed blag closed 11 months ago

blag commented 4 years ago

What

Identify a path for st2chatops that:

Why

The existing approach is:

How

The Purpose of this Issue

Discussion

Initial Post by @blag

Our ChatOps microservice is just...weird compared to all of our other microservices. It's written in asynchronous Javascript, so it can't share any code with the other normal Python-based microservices, and it must duplicate a lot of the code found in our Python st2client. This leads to more complicated development and release processes, since we have to keep st2client.js, hubot-stackstorm, and st2chatops all up-to-date and in-sync with the rest of the ST2 repositories.

I propose that we look into Python-based ChatOps bots as a base to use for future st2chatops. Two that people have told me about at ErrBot and OpsDroid, both are Python projects.

I'd like to give a huuuuuge shoutout to @nzlosh, as almost all of this is based on conversations I've had with him, and he has done a lot of investigating into this subject on my behalf. Thank you for your work!

punkrokk commented 4 years ago

I would throw out there that I might desire to see discussion around the js client and how to automate builds of it. Keeping that around would be a benefit for the front end, right?

m4dcoder commented 4 years ago

@blag Given the current constraint on resource and that the team is mostly python centric, let's go for it if this will continue to push the chatops feature forward in st2. I like to see RBAC, inquiries integration, and conversation/thread support in chatops. The only concerns here is the license scheme for these projects. As for @punkrokk feedback, let's delegate that to the st2web/st2flow UI discussion. I could be wrong but I don't think st2chatops, hubot-stackstorm and st2web are sharing the st2 client (at least from reviewing their package.json files).

nmaludy commented 4 years ago

I'm in support of this! Also concerned about GPL with ErrBot as @m4dcoder said.

Would love to get more python centric and make it easier to add new features to the bot framework!

arm4b commented 4 years ago

I think I wasn't fan of changing the chatops engine before as I'd care about full backwards-compatibility for existing ChatOps users and consequences when switching to the new platform. For years we built user experience around hubot that's stable enough and some of our users might tightly integrate with that. I also believe that Hubot community is overall more rich and popular comparing to others.

I don't think there is a problem with how st2chops dependencies designed or work or if there is a problem with code duplication or sharing. For example, I don't see how https://github.com/nzlosh/err-stackstorm re-uses StackStorm python st2client code. It re-implements ST2 API consumption as well, same as current st2chatops + st2client.js does.

The root problem here is that we don't have anyone who could drive the javascript-based Hubot st2chatops development (previously it was @emedvedev). That's the only real issue.

Another important part is migration path for the users and potential consequences. Would it be possible to switch the platform with little to no bad impact or incompatibility, how would the new thing fit the current st2 picture, how much work is required for that and how that diff might look like for our community eventually?

If switching the ChatOps wheels which already means making high stakes, I'd try to take all the best we can with the next-gen OpsDroid and NPL/NLU extensions blag was mentioning on top of what we already have in chatops syntax. After all, in 2020 you expect more from bots. I'm wondering how that experience potentially might look like in our context and fit StackStorm? I'd suggest everyone to go deeper and explore these 2 platforms closer to understand the pros/cons and feeling. After having better picture and trying the prototypes, I think it also makes sense to try and get more feedback from our community (how about chatops user survey?)

nzlosh commented 4 years ago

I don't know the exact concerns @nmaludy and @m4dcoder have in regard to the GPL3 licence but I'll mention the concerns @blag and I had with switching to errbot (GPL3) with err-stackstorm (Apache 2.0).

Disclaimer: I'm not a lawyer, this is my own understanding of how the GPL3 licencing applies.

The two key issues identified were:

Bundling errbot in chatops package

Will distrubuting GPL3 licenced software in the same package as non-GPL3 licenced software risk having the other software tainted?

Bundling differently licenced software doesn't require them to accord their licences with each other. Based on section 4 Conveying Verbatim Copies

You may convey verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice; keep intact all notices stating that this License and any non-permissive terms added in accord with section 7 apply to the code; keep intact all notices of the absence of any warranty; and give all recipients a copy of this License along with the Program.

You may charge any price or no price for each copy that you convey, and you may offer support or warranty protection for a fee.

errbot GPL3 licence transitivity on non-GPL3 errbot plugins

According to https://www.gnu.org/licenses/gpl-faq.en.html#GPLPlugins the GPL3 considers errbot plugins to be "combined work". Despite this clause, the errbot project has explicitly included an exception to allow plugins, scripts and addons not bundled with errbot to use any licence they want.

It is our opinion that this constitutes intent on the part of the errbot authors to allow plugins to not be considered "combined work" and would not be required to have the GPL3 licence imposed.

Errbot's explicit allowance of non-GPL3 lincences

https://github.com/errbotio/errbot/blob/master/gplv3-exceptions.txt

As a special exception, the copyright holders of Errbot hereby grant permission for plug-ins, scripts or add-ons not bundled or distributed as part of Errbot itself and potentially licensed under a different license, to be used with Errbot, provided that you also meet the terms and conditions of the licenses of those plug-ins, scripts or add-ons.

To address the point @armab raised around err-stackstorm's lack of use of st2client, I'll explain why:

err-stackstorm did use st2client for many years until st2client pinning the requests module version that conflicted between chat backends required version of requests. Since err-stackstorm had no claim or weight in the StackStorm project, it was easier for me to remove st2client and call the St2 API directly rather than spend scarce time and energy lobbying to remove the requests constraint. This is historic, the conflict probably no longer exists and migrating to st2client within err-stackstorm would be trivial.

As a StackStorm user I have always considered the requirement of installing/maintaining nodejs with StackStorm abberant. Being able to debug issues, contribute patches and understand how StackStorm functions is important to me. Context switching between Python and Javascript is inefficient when not working with both languages on a regular basis. Keeping focused and reducing maintenance effort by using Python only is what err-stackstorm was about for me.

I agree that the user experience will be impacted in transitioning away from hubot. This impact would be less significant than switching away from Mistral to Orquesta. Action-Aliases would remain unchanged, the only users that would be significantly impacted are those that wrote their own hubot plugins.

It wouldn't be too difficult to provide both hubot and python-based bot packages during a transition period to ease user migration pain. Letting users decide which bot to use until hubot was phased out.

cognifloyd commented 4 years ago

I would like to see us move to a chat platform that allows for more interactive bot usage such as drop down or calendar selectors, text/numeric input fields, etc. Consider these slack features: https://api.slack.com/block-kit/interactivity. I was able to inject some parts by using extra and "attachments"[1] but that kind of messaging only goes so far. I'd really like to present a formatted inquiry where the bot can say: which client is this for? with a dropdown list of clients that they can select from. Or which servers are you targetting with a multi-select drop down populated with the servers that are relevant for the context of the active workflow. Hubot doesn't provide the ability to offer that level of interaction.

Do ErrBot or OpsDroid offer more advanced interactivity/interface options?

[1] https://docs.stackstorm.com/chatops/aliases.html#passing-attachment-api-parameters-slack-mattermost-and-rocketchat-only

cognifloyd commented 4 years ago

Glancing through the ErrBot docs, I see a "cards" feature that allows for using Slack's attachments (like with hubot), but not something for more interactive inquiries.

edit to add relevant issue links

issues related to getting this with errbot:

Oh, and I filed this awhile back. Hmm. looks like I need to respond...

This issue only addresses adding support for blocks which is a pre-requisite to interactivity. But, interactive prompts would still be beyond hubot to support.

cognifloyd commented 4 years ago

It looks like OpsDroid supports slack's interactive Blocks: https://docs.opsdroid.dev/en/stable/connectors/slack.html#interactive-actions

So, unless ErrBot supports ~this~ blocks (including interactive action blocks), I would love to see us go with OpsDroid which DOES support interactive action blocks.

Plus the possibility of using NLP would be awesome!

nzlosh commented 4 years ago

Errbot's cards are not intended for interactive inquiries. Errbot has the notion of flows https://errbot.readthedocs.io/en/latest/user_guide/flow_development/concepts.html which may be able to provide interactive actions without being tightly bound to Slack.

While I think it'd be great to have more interactivity between the chat backend and StackStorm, I'm not a fan of tightly binding StackStorm functionality to Slacks feature set. I think there's room to review how chatops is done in StackStorm but drop down or calendar selectors, text/numeric input fields, etc. aren't bot features, they're chat backend features. Currently Errbot doesn't have Slack blocks support.

Errbot was built with the philosophy that when the bot provided a feature it should be available and behave the same way regardless of chat backend. This allows the same bot behaviour to be maintained between, xmpp, slack, mattermost, rocketchat, gitter, and discord. This can be seen in the templating support https://errbot.readthedocs.io/en/latest/user_guide/plugin_development/messaging.html#templating that will emulate tables on backends that don't natively support them. (the formatting can be ugly but it works).

cognifloyd commented 4 years ago

Re: supporting MS Teams in OpsDroid - The OpsDroid folks seem open to a PR to add support for it. One person started trying to add support using the BotFramework but found that BotFramework and OpsDroid conflicted (surprise surprise).

So, if this is something we would need to tackle, then the OpsDroid connector for Teams will need to use the underlying APIs (instead of BotFramework) described here: https://docs.microsoft.com/en-us/azure/bot-service/rest-api/bot-framework-rest-overview and there's a Swagger 2.0 spec for it, so we could generate a python client lib for it: https://github.com/microsoft/botframework-sdk/blob/master/specs/botframework-protocol/directline-3.0.json

edit: ~WIP~ merged Teams support for OpsDroid at https://github.com/opsdroid/opsdroid/pull/1679

cognifloyd commented 4 years ago

I'm not a fan of tightly binding StackStorm functionality to Slacks feature set.

Makes sense. Is there similar interactivity in other chat platforms? If there's something similar maybe we can find a good abstraction to enable more interactive inquiries via chatops.

Then again, maybe a new chatops sensor could allow for more interactive flows in some platforms? The interactions are just "events".

cognifloyd commented 4 years ago

One of my issues is that I have designated a slack channel for a particular st2 workflow, but specifying the input parameters is unwieldy because it requires the Tier 1 Tech Support person type the (case sensitive) target customer's name. It would be so much nicer to run some action alias and then have an inquiry where the options are presented in a drop down. For other chat platforms, that could be done by sending a message (an ephemeral message for platforms that support that) with a list of the valid options that the user could copy/paste.

If we could use ErrBot Flows to achieve that interactivity that would be cool, and then maybe the slack adapter could convert that into slack native action blocks where it makes sense.

cognifloyd commented 4 years ago

Quick comparison based on these docs (plus some googling):

Chat Service ST2chatops w/ Hubot ErrBot OpsDroid
Slack :heavy_check_mark: Official :heavy_check_mark: included in core :heavy_check_mark: included in core
Microsoft Teams :white_check_mark: Official (awkward use of BotFramework) :grey_question: https://github.com/errbotio/errbot/issues/1239
:white_check_mark: via BotFramework
:white_check_mark: only uses internal API SDK from BotFramework (avoiding the rest of BotFramework)
https://github.com/opsdroid/opsdroid/pull/1679 was just merged. Not released yet.
Mattermost :heavy_check_mark: Official (v5) :white_check_mark: external backend :heavy_check_mark: included in core
Rocket.Chat :heavy_check_mark: Official :white_check_mark: external backend :heavy_check_mark: included in core
Webex Teams (Cisco Spark) :heavy_check_mark: Official for Cisco Spark :white_check_mark: external backend for Webex Teams :heavy_check_mark: included in core
Flowdock :white_check_mark: Provided but unsupported :x: https://github.com/errbotio/errbot/issues/169 :x:
XMPP :white_check_mark: Provided but unsupported :heavy_check_mark: included in core :x:
IRC :white_check_mark: Provided but unsupported :heavy_check_mark: included in core :x:
Hipchat :x: Support dropped :heavy_check_mark: included in core :x:
Telegram :x: :heavy_check_mark: included in core :heavy_check_mark: included in core
CampFire :x: :white_check_mark: external backend :x:
Discorcd :x: :white_check_mark: external backend :x:
Facebook Messenger :x: :x: https://github.com/errbotio/errbot/issues/715 :heavy_check_mark: included in core
GitHub :x: Not as a chat service :x: :heavy_check_mark: included in core
Gitter :x: :white_check_mark: external backend :heavy_check_mark: included in core
Matrix :x: :white_check_mark: external backend :heavy_check_mark: included in core
Skype :x: :white_check_mark: external backend :white_check_mark: external https://github.com/koodaamo/opsdroid-skype
tox.chat :x: :white_check_mark: external backend :x:
Vk :x: :white_check_mark: external backend :x:
Zulip :x: :white_check_mark: external backend :x:
punkrokk commented 4 years ago

I am observing MSTeams gaining significant market traction, in two ways: vs Slack and vs. Zoom. FWIW

cognifloyd commented 4 years ago

These platforms have some kind of interactivity (menus seems unique to slack and mattermost, buttons is more common, teams and facebook allow iframes to load just about anything):

An interesting survey from Rocket.Chat on what platforms have interactive components as they look to add the common interactive components: https://github.com/WideChat/Rocket.Chat.Android/wiki/A-Survey-of-Rich-Messaging-in-Chatbots

These platforms would require textual replacements for any interactive components:

Dead platforms:

blag commented 4 years ago

The biggest roadblock with supporting MS Teams (at least when I looked into it) is that they only support webhooks for delivering messages to your bot, meaning that our users would have to open up a hole in their firewall, or do some networking magic, to securely use ChatOps with Teams. Literally every other chat provider that I'm aware of supports message delivery over websockets.

I was not impressed with the MS Teams developer experience. They seem very top-down managed, and I don't think that interfaces well with how we do ChatOps. We support a bottom-up approach to ChatOps, since we let our users define their own custom ChatOps aliases/commands, as limited as they may be. And that's not something that I've really seen anybody else try to support. MS Teams ChatOps seems to be based around an app market where "send all of your chat data to this third party to enable them to work their own natural language processing magic" is the normal way of things.

So when I was looking into it, MS Teams didn't integrate well. Due to that, and the fact that our current adapter is incredibly limited and just barely supported to begin with, I don't really care if we lose integration with MS Teams when we switch over to something else. I don't think we can reasonably consider it well supported by us at this point anyway. It is "officially supported", but it's not "well supported".

blag commented 4 years ago

I would throw out there that I might desire to see discussion around the js client and how to automate builds of it. Keeping that around would be a benefit for the front end, right?

@punkrokk No, I don't think it's worth it to automate that release workflow, for a few reasons:

  1. I don't think st2web or st2flow use st2client.js, because...
  2. st2client.js is marked as deprecated and unsupported (and it is, even if we do actually keep it in barely "working" status)
  3. I don't think st2client.js releases require much automation: npm test and npm publish, and the last one requires our NPM credentials.
  4. The releases for st2client.js, and even hubot-stackstorm, don't have release cadences that follow the rest of StackStorm, and they don't follow StackStorm version numbers either.
blag commented 4 years ago

I would also point out that MS Teams might be growing their userbase, but Slack and Webex already have large userbases, so comparing growth rates may not a valid approach when considering this.

blag commented 4 years ago

Regarding chat provider support, it looks like for the five that we actively care about:

All of those are more-or-less supported to the same extent by hubot, Errbot, and Opsdroid. All of the other chat providers are nice to have, but shouldn't sway us in one direction or the other.

blag commented 4 years ago

Regarding feature support, particularly support for rich or interactive cards or whatever, we will still have to build support for that into our adapter no matter what, as hubot-stackstorm does not explicitly support that yet either.

cognifloyd commented 4 years ago

hubot-stackstorm does not explicitly support [rich or interactive cards] yet either.

Yes. And getting support into hubot is a much bigger reach than in Errbot and OpsDroid. Once we've switched to another chat framework, the development effort in the st2chatops layer should decrease compared to adding the same feature with Hubot.

As far as features go that st2chatops uses and needs right now, a key feature in st2-hubot is keeping the list of aliases up-to-date. What err-stackstorm does is nicer in this regard as it doesn't have to poll ST2. Here is a brief comparison of implementations (or possible implementations for OpsDroid):

For the NLP/AI features of OpsDroid to make a difference we have to be able to pass our action-alias formats into OpsDroid somehow so that it can run relevant matchers. Maybe we could do something like this gist to pass all of the st2 action alias formats into OpsDroid.

Continuing that line of thought, with both err-stackstorm and OpsDroid, it would be interesting to move some part (all?) of the action-alias recognition into the bot itself to reduce complexity.

edit: to add additional links about OpsDroid as I find them

nzlosh commented 4 years ago

As I understand things, @Kami wanted to keep ChatOps logic in the st2 core so that bots could be kept as simple as possible (https://github.com/StackStorm/st2/issues/3770#issuecomment-332848202). This was the reason err-stackstorm uses the st2 api as much as possible without having a lot of decisional processing happening bot side. This is great for keeping StackStorm ChatOps features bot agnostic as much as possible but I think it comes at the cost of stifling ChatOps rate of evolution because the bar for entry to the St2 core is high.

The only exception to the rule is err-stackstorm's chatops authentication mechanism that was developed to allow chat users to authenticate with their StackStorm credentials to run action-alias as an actual st2 user and not as the bot. (https://err-stackstorm.readthedocs.io/en/latest/authn.html#authentication)

Errbot also comes with native ACL support. Opsdroid doesn't appear to have this functionality. If I remember correctly, hubot has a plugin to add acl support. Errbot's ACL features allows fine control over which user can execute which command in which channel. https://err-stackstorm.readthedocs.io/en/latest/authz.html#errbot-access-control-list

I haven't seen a lot of demand for NLP in the Slack community channels and while errbot doesn't offer this out of the box, errbot does have command filters (https://errbot.readthedocs.io/en/latest/user_guide/administration.html?highlight=filter#command-filters) that are developed as plugins. The effort to add NLP support to errbot would be as trivial as adding a pack to StackStorm IMO.

arm4b commented 4 years ago

One of the old-era ChatOps limitations behind Hubot, - it's not possible to run it in HA mode. This is a missing brick in StackStorm HA story and was requested several times in community and is one of the chatops adoption difficulty from the larger orgs.

It's one of the features I'm also looking for when considering new platforms. I couldn't find any HA evidence in both Errbot and OpsDroid so far.

There is a K8s Helm chart for OpsDroid (https://github.com/opsdroid/helm-chart) however it relies only on 1 hardcoded single replica which proves non-ha capabilities.

I'll keep the research going, but if anyone has more context about HA capabilities/workarounds/potential around Errbot and OpsDroid, - please share your findings.

nzlosh commented 4 years ago

I asked the question in ealry 2019 around HA in errbot community and there were a few suggests (more or less what opsdroid's done for HA). Unfortunately, I didn't file an issue on github :

Carlos @nzlosh Jan 30 2019 11:37 Where are things at with errbot supporting High Availability?

Andrew Herrington @andrewthetechie Jan 30 2019 15:26 @nzlosh that is an exercise left to the end user.

We deploy our errbot via kubernetes and let it handle availability for us. I started on a plugin, using cmdfilters and an external zookeeper to do HA but it was buggy and not a great way to go about it.

Carlos @nzlosh Jan 30 2019 15:36 @andrewthetechie are you saying that if you can stand up two instances of errbot, it's just a matter of putting some sort of load balancer in front to select an "active" instance and the rest is handled by errbot?

Sijis Aviles @sijis Jan 30 2019 17:15 If you have shared config storage (https://github.com/errbotio/err-storage-sql, https://github.com/sijis/err-storage-redis, etc) and shared plugin directory (NFS, EFS, etc), you should be able to do HA and have 2 errbot instances behind an ELB. I can't see a reason why not. This should work from a webhooks perspective. My only concern would be a chat reply. Would a person get 2 replies or 1? My speculation they will get 2 responses. The reason is that each instance will be connected to the Slack (for example) backend and will both see the incoming message and thus reply. Kirk Bater @iamkirkbater Jan 30 2019 17:51 Yeah, the socket wouldn't send to a load balancer. I'm thinking about building a queue system where all commands get put into the queue and then have separate workers to do the jobs for those commands. So there would still be one Errbot instance itself, but all the workers would be separate. Still not "HA" but better than nothing.

Kirk Bater @iamkirkbater Jan 30 2019 17:58 I guess in that case though you could use like the message ID or something and use that as a unique ID in the queue so you can't queue more than one job from the same command.

Sijis Aviles @sijis Jan 30 2019 18:05 ya. i was thinking that too. using the message Id as the identifier. Does it warrant the effort and complexity in adding this functionality? I'm not necessarily against it. I personally (and at work) do use errbot but if its unable to process a command, its absolutely annoying , but its not critical. I'm sure that's not the same in other orgs or maybe not?

Kirk Bater @iamkirkbater Jan 30 2019 18:13 Yeah, in ours we use it for self-service, so "everyone" in our org can do things like spin up infrastructure for new clients or for developers to test their changes, from other bots that run commands on this bot to spin up infrastructure to run tests on, etc. So 5 minutes of downtime isn't terrible, but we've run into complications in the past where we had to do something like rotate a slack token and we didn't communicate beforehand that there would be downtime so some people got upset. So even for us where this is "kind-pretty-critical" having a full HA setup hasn't been worth the effort so far, but it is something I want to do more and more because then we can even do things like Blue/Green deployments, etc and just have fully automated CI/CD pipelines with 0 downtime.

Sijis Aviles @sijis Jan 30 2019 18:25 ya, that understood. I wonder if could have a plugin that just captures the user's request and then does a requests.post(errbot_url_webook,....) to the plugin's webhook. In theory, that would only have 1 instance resond. ohh, it wouldn't cuz both instances would had picked up the initial request.

Carlos @nzlosh Jan 30 2019 21:09 Interesting comments, thanks for sharing. I maintain the err-stackstorm plugin. StackStorm being an event driven platform for automating infrastructure. The use cases for ChatOps tend to orient around infrastructure operations and CI/CD sort of tasks. As it stands today, the plugin err-stackstorm is a SPoF. I'd like to work on improving that, but there's little point in doing that if errbot itself is a SPoF. It sounds like you can get resilience to a single bot failure but there are functional trade-offs that may not be ideal as there's no way to guarantee the called tasks will be idempotent (for err-stackstorm's use cases at least). I'm just trying to get an idea of where the community/developers are at with the notion of built in HA and what priority/urgency there is around implementing it.

Andrew Herrington @andrewthetechie Jan 31 2019 00:40 @nzlosh sorry, not really. You would need to figure out your "HA" solution.

Our concept of HA is allowing Kubernetes to make sure we always have a running errbot instance. Minor downtime isnt that big of a deal.

There's been some good discussion today of possibilities. I'd suggest filing an issue requesting a HA or clustering feature for errbot. It might be something that someone else is passionate about and takes up.

cognifloyd commented 4 years ago

So far, in the matrix chat for opsdroid, the consensus is that connectors (things that connect with websockets to the chat services) make the HA story difficult. Once the message is received, it would be fairly simple to introduce HA for skills:

lwhalen 13:39 I would think that it'd be someone straightforward to make a Skill HA. Opsdroid itself is pretty stateless. Have your Skill connect to a common Redis instance (or whatever) and have your multiple opsdroid instances handle events off the redis state. Maybe?

Jacob Tomlinson (Slack) 13:48 Sure. The event system would be quite straightforward. My concern is the connectors. Many of them work by polling APIs or by making websocket connections. These are harder to make HA.

lwhalen 13:52 if you containerize opsdroid and deploy it as a small job on a pre-existing cluster with an instance of '1', such that it restarts if/when the container dies, does it need to be HA? you'll have a few seconds of downtime when the container goes away and is re-deployed, but depending on your use-case, is it that terrible?

Jacob Tomlinson (Slack) That's how I run opsdroid.

edit: add this OpsDroid response

Jacob Tomlinson (Slack) 15:21 Yeah I don't think we have any intention of supporting ha at this time.

cognifloyd commented 4 years ago

An interesting post from lyft on Slack and HA:

For example, when creating a simple bot, the easiest choice is the RTM API, because getting started with it is quick, and it doesn’t require accepting web requests from the outside world. Unfortunately, if the bot needs slash command support or interactive components, it’ll be necessary to accept web requests from Slack. Also, if the bot needs to be highly available (HA), the RTM API won’t work, because only a single instance can be connected at a time to the RTM API. Changing from the RTM API to some of the other APIs is nearly a full rewrite, making this a painful early mistake.

https://eng.lyft.com/announcing-omnibot-a-slack-proxy-and-slack-bot-framework-d4e32dd85ee4

Lyft's solution for HA with Slack (yeah - slack only :frowning_face:) was creating https://github.com/lyft/omnibot.

So with errbot or OpsDroid or whatever, the piece that connects with the chat service would need to say whether or not it supports HA and implement the HA connection to the chat service in whatever idiosyncratic method is required for that service.

punkrokk commented 4 years ago

What is the technical limitation to realizing HA? Is this something that could be resolved with Network hot-cold, service-discovery or anything like that? Is it just a core design decision of all the different options?

cognifloyd commented 4 years ago

As far as I understand, the issue with implementing HA is chat service API limitations. ie - there can only be one consumer for a websocket connection, so anything that uses a websocket instead of webhooks (where webhooks could easily be load balanced) is inherently a SPoF.

You could have multiple instances each with a websocket, but then (at least for slack) you'd need to have separate slack tokens for each to be able to have them connected at once, so they'd end up being separate bots.

Assuming you can get around the multiple bots connected at once issue, multiple bots listening for the same thing will both get the chat messages that will trigger commands/skills => st2 actions. Both ErrBot and OpsDroid say that they don't have a good solution for how to make sure that each message is only handled once.

So, whichever piece connects to the chat service often ends up being a "singleton" of sorts. I think some kind of hot/cold solution would make the most sense for this. But, then we need to ask: what is the speed difference between (a) starting a new chatops container if a container "dies", and (b) in a hot/cold scenario how long does it take an existing container to connect to the chat services?

I was surfing through the OpsDroid code today, and there's a method to reload all connectors and skills in a running OpsDroid instance. We could start 2 (or more) containers in hot/cold, where the config for the cold one has all connectors disabled, and then as soon as the hot container goes down a reload gets issued to the cold one to get it to load the hot config. To achieve that right now, I think we'd need to config files, one hot and one cold, and then a symlink pointing to the active one. Changing the symlink is quick, and then trigger the reload. Everything but the chat connections would already be loaded thus minimizing the down time.

Doing this wouldn't require any special support in OpsDroid, and it would probably be minor enough that we could contribute it back.

Also, if there's a reload option we might be able to do something similar in ErrBot. Or even, start the cold container with the chatops service stopped, only starting it when necessary (like when the primary chat container dies).

I'm not familiar enough with service-discovery to comment on how that might fit with the existing implementations of ErrBot and OpsDroid.

nzlosh commented 4 years ago

Just to confirm, Errbot does have the means to enable/disable plugins. It will reload the backend if it detects a disconnection. So all the methods are in the framework that would allow managing backend/plugin state based on the notion of an "active" bot.

I'm in favour of a software solution using some sort of clustering protocol implemented in the core of the bot rather than imposing containers as an HA solution. The choice of containers is an architecture decision that should be left as a choice by the end user. I was thinking the raft consensus protocol could be a good fit to help manage the election of the active bot. The active bot would bring up plugins/backends and tear them down when the bot was no longer active leader. A bot and by extension code executed by the bot that maintained state would need to use a common store. Some sort of clustered storage like consul/etcd/zookeeper for example. As far as I know, redis is a single instance server so wouldn't be a good fit for HA.

arm4b commented 4 years ago

https://github.com/opsdroid/opsdroid/issues/299 clustering design discussion around OpsDroid. https://github.com/opsdroid/opsdroid/issues/193 Healthcheck endpoints is also important part to make that possible.

I think the overall HA state behind both OpsDroid and errbot is clear. There are no HA primitives in any of them, same as hubot.

arm4b commented 4 years ago

Getting back to this discussion as a follow-up.

Thanks for everyone for their research and insight! I think we had enough useful tech information here around new ChatOps.

What would be helpful next, - is preparing and starting a ChatOps User Survey and get more info from our users about who they are, how they actually consume chatops, what is the functionality they rely on, what are the blockers in their adoption, what they need and so on. With that we can bring this new question around new proposed st2chatops framework.

This will make our future ChatOps decision more educated and planning more aligned based on user's data.

cognifloyd commented 4 years ago

One more technical thought: The chatops framework has a similar issue with HA as sensors. In a way, chatops is a specialized sensor. So, whatever we come up with to solve StackStorm/st2#4301 might be reusable for either errbot or OpsDroid (they are python after all).

On the ChatOps User Survey, I remember doing a StackStorm survey some time ago. What was the process for creating that? What is the next step? Draft survey questions?

arm4b commented 4 years ago

We had yearly StackStorm User Survey here: https://stackstorm.com/2019/01/30/2018-year-in-review-2019-stackstorm-user-survey/ (https://www.surveymonkey.com/r/st2-2019-user-survey)

Drafting ChatOps questions would be a good first step. After that it's best to work with @blag around this effort.

blag commented 4 years ago

I'm not convinced that a ChatOps user survey would generate much actionable information for us when deciding on this particular proposal. Previous user surveys have largely focused on what features users would like to see, and this proposal is focused on what should be considered an implementation detail for our users. As such, I don't think it would be highly useful of us to create a ChatOps user survey to ask our users if/how they want ChatOps reimplemented. And if we're going to ask them about what features they would like to see, we already have the answer from the previous user survey: the most highly requested feature is RBAC for ChatOps 1. Other feature requests are support for inquiries 2 and conversation/thread support 3.

So for this proposal, we should investigate whether any proposed alternatives can reasonably support the current features of hubot-stackstorm, and whether any of them can reasonably be extended to support those additional three features. Any and all other feature requests beyond what hubot-stackstorm currently supports are outside the scope of this discussion for now, in my opinion.

The bulk of my proposal boils down to two points:

  1. It is difficult to continue supporting, troubleshooting, and developing a Node.js-based st2chatops, as the entire TSC is primarily Python based.
  2. st2chatops is increasingly difficult to develop as a non-integrated microservice, since the way we have decided to implement some highly requested features (like ChatOps RBAC and inquiries) would be much easier to write if ChatOps was a normal ST2 microservice with full access to shared state and ST2 configuration.

All other considerations - wider support for chat providers, improved support for user interactivity, etc. are all orthogonal to this proposal.

One additional concern that I haven't yet addressed is our users who package their own ChatOps bots and use hubot-stackstorm directly. Even if we adopt my proposal to switch 100% to a Python-based bot framework, the existing st2client.js, hubot-stackstorm, and st2chatops repositories will remain open source. I believe that we can end further development of them immediately and put those projects into "maintenance mode", where we only accept bugfix pull requests. Those users can then switch to our new ChatOps functionality (whatever that is) when they reach end-of-support status, or they can fork and continue development themselves.

1 I have proposed a mechanism for ChatOps RBAC based on the implementation in err-stackstorm, but the decision was to use something more aligned with configuration-as-code and more tightly integrated with the RBAC plugin. ChatOps RBAC really boils down to ChatOps authentication + extending StackStorm's RBAC backend to handle ChatOps-specific roles for the chat user. ChatOps RBAC without ST2 RBAC would be exceptionally difficult to implement, so this feature would be restricted to EWC customers only. 2 The code for handling inquiries already exists in hubot-stackstorm, but it's in a very...early...state (read: it technically works, but not well, and it would be difficult to get it to work well). Additionally, we don't have end-to-end tests for it, or comprehensive documentation either, so it's a mostly dormant feature. If I was reimplementing this in another bot, I would definitely try to implement it in a way that didn't make HA more difficult. If st2chatops was packaged alongside the rest of the ST2 microservices, and especially if it could reuse st2client or write to Mongo or Redis directly, this would be a lot simpler to implement. 3 Conversation/thread support is not well supported by all chat providers, although this situation is also changing, or has possibly already changed. This makes the "abstraction layer" to the various chat providers that st2chatops is attempting to be a very "leaky" abstraction. So either we need to improve support for this in StackStorm itself (this could be as simple as disabling that option in StackStorm if a non-compliant chat provider is configured), or we need to emulate this feature for non-compliant chat provider adapters.

arm4b commented 4 years ago

I don't think we decided yet whether we'll rely on old hubot-stackstorm, full-feature ErrBot or next-gen OpsDroid.

If we make big decision like changing the engine behind the chatops framework I think it's best to have a better picture about the current chatops community. Who they are, how they actually consume chatops, what is the functionality they rely on, what are the blockers and biggest pain points in their adoption, what's missing, what they need, gather random ideas, opinions and so on. This will also generate more traffic and points of view in this thread. Every User Survey brings some surprising data we never thought of.

I'm OK that chatops may require a change and I'd read it as a full restart for the chatops project. However I'd like to make sure we do educated decisions based on sufficient data and with good understanding of entire chatops domain. This will not just help in making a better decisions, but also help how to plan, what to focus on in the future and where more careful approach is required.

blag commented 4 years ago

I don't think we decided yet whether we'll rely on old hubot-stackstorm, full-feature ErrBot or next-gen OpsDroid.

True, we haven't reached a conclusion yet, sorry if I wasn't clear on this point. I'm just trying to constrain this discussion to two questions:

  1. Can/should we switch ChatOps to a more tightly integrated, Python-based service?
  2. Which bot framework we should switch to?

Any discussions about what future features we'd like to add are largely tangential to those two subjects, and I think it would be appropriate to section that off into a separate discussion/proposal.

If we make big decision like changing the engine behind the chatops framework I think it's best to have a better picture about the current chatops community. Who they are, how they actually consume chatops, what is the functionality they rely on, what are the blockers and biggest pain points in their adoption, what's missing, what they need, gather random ideas, opinions and so on. This will also generate more traffic and points of view in this thread. Every User Survey brings some surprising data we never thought of.

I think the vast majority of our existing ChatOps users use ST2-flavored ChatOps via the aliases mechanism. I think we have very few - if any, at this point - users rolling their own st2chatops using our hubot-stackstorm plugin. And given how few outside patches we've received for hubot-stackstorm, I'm not feeling particularly charitable to anybody who wants to complain that we're changing too much.

This will not just help in making a better decisions, but also help how to plan, what to focus on in the future and where more careful approach is required.

This approach is perfectly normal for a paid-for product, but since StackStorm has transitioned to the LF, it's now less of a product and more of a project. And with open source projects, it isn't the customers who drive change - because there are no paying customers - it's the developers. And if users want to change the project, or prevent a specific change from happening, they have to get themselves involved in the development process. As such, I would expect to see those people giving their opinion in this thread, and so far both @nzlosh and @cognifloyd have been supportive of switching. But if we have users who don't want this change to happen, they should already be involved in the project, and this discussion. I'm skeptical of any users who think they care about this implementation detail but aren't already involved in this discussion.

Transitioning to using a different ChatOps backend largely shouldn't change how users' ChatOps aliases function, so that abstraction layer should still work the same as it always has. For more "off the beaten path" users who are rolling their own ChatOps, if they want to see ST2 ChatOps grow, develop, and evolve with the current Node.js implementation, then they should have already been involved. The fact that we really aren't seeing any pushback from those users tells me that they don't actually care about this, or they at least don't care as much as we seem to think they do.

Either way, it's ST2 developers, maintainers, and contributors who get to chart the course of the project. End users who want a say in an open source project can get involved in those exact ways, but I don't think they get to dictate how this project grows if they aren't actively involved in the development or at least funding the project on some level.

I have proposed questions for the ChatOps User Survey in matrix-org/matrix-spec-proposals#19. Let's discuss that there, run the survey, and then discuss back here once we have the results from that.

arm4b commented 4 years ago

And with open source projects, it isn't the customers who drive change - because there are no paying customers - it's the developers.

Right. However Engineers driving decisions doesn't mean that they should only think about the code and not thinking about users and from their perspective taking into account feature parity issues/consistency/usability overall experience and how that fits with the whole platform. I think that's how StackStorm engineers approached it before and it's not because of the customers.

And if users want to change the project, or prevent a specific change from happening, they have to get themselves involved in the development process.

That's the dream. The reality is that more pain points in the software leads to a situation when people just stop using it. I think we Maintainers as someone who drive the project should be in sync with the community, ask questions, work with them, know their needs, issues and blockers. You make your users more successful and the project becomes more popular. Everyone wins. Otherwise it'll lead to a weird bloatware that solves problems for just a few people contributing to it.

arm4b commented 4 years ago

Overall I'd like to make sure that we know and research the ChatOps domain in full before 3 engineers initiate the platform reset (I tend to think about it this way). And I believe this is just start, so how you can lead it if you don't see the big picture?

I believe that more data, research, better understanding of community, use cases, pain points, chatops experience in general will drive better decisions, including transition and implementation. Otherwise we're navigating blind. You may also find something that you don't know yet.

And yes, thanks for starting the discussion with the Survey Brainstorming at matrix-org/matrix-spec-proposals#19! Absolutely awesome stuff! :+1:


Adding to the point of blindly driving: how do we know if decisions we'll do are good if we're not even using chatops ourselves at StackStorm today?

@blag @nzlosh @cognifloyd Is this the right time to resurrect the st2 ChatOps instance in Slack community?

cognifloyd commented 4 years ago

Is this the right time to resurrect the st2 ChatOps instance in Slack community?

There was an instance in the community? What things could it do? @nmaludy's beertab? :smile:

I'm :+1: on having a community st2 ChatOps bot. It should be named "stanley" :wink:

jacobtomlinson commented 4 years ago

I just wanted to drop in and say hi, I'm the creator of Opsdroid and I really appreciate you considering using it. This has been a really interesting read.

We are very open to input and discussions about how we can improve the framework. One of the main things we struggle with is there are so many things we want to do on the project, and being an open source project we have limited time and don't have the visibility of our user's behaviour to accurately prioritise.

So dropping into our issues or Matrix chat to give us a nudge in a certain direction is much appreciated. Contributions are also very welcome.

We have four maintainers who look after the project in our spare time (although I just had a second child so am a little less active than normal and probably will be for the next few months).

I also just updated our roadmap to show that Advanced messaging client events like user login should be available is checked. We support non-chat events in a number of connectors and are keen to add more.

Finally, I wondered if I could pinch the comparison matrix that @cognifloyd put together for our documentation. It's really helpful for us and our users to see a comparison like that. It also gives us some help with prioritisation to ensure we are comparable to the alternatives.

cognifloyd commented 4 years ago

@jacobtomlinson go for it :) Note that the leftmost column is specifically about ST2 + Hubot, not hubot alone. That's relevant for the discussion here, but probably not in the OpsDroid docs. If you want to include Hubot, then I would go evaluate hubot on its own and replace that column.

jacobtomlinson commented 4 years ago

Thanks will do!

It would also be interesting to create a comparison matrix of other features not related to chat integrations. I'm thinking things mentioned in this issue like support for NLP/NLU services, non chat events such as user's joining and leaving rooms/channels, support for interactive elements like Slack blocks.

As users external to the frameworks you are discussing I would like to hear your thoughts on what those features could be and how valuable they are to you?

arm4b commented 3 years ago

StackStorm ChatOps Plans Follow-up

ChatOps User Survey (https://github.com/StackStorm/discussions/issues/48) uncovered some inspiring data and answers as well as a lot of supporters from community willing to help moving the platform to the new Python rails.

@blag it would be great to plan and organize the StackStorm ChatOps open meeting under your leadership and invite everyone interested @StackStorm/tsc @StackStorm/contributors and others offered their to help to discuss the direction, efforts and plans on StackStorm ChatOps side, similar to what we do with the TSC Meetings.

Let me know how I can help :wink:

cognifloyd commented 3 years ago

Following up, OpsDrois has a (WIP) functional connector that adds support for MS Teams: https://github.com/opsdroid/opsdroid/pull/1679

It does use BotFramework, but only to grab it's internal API SDK. It doesn't have the rest of BotFramework's baggage.

cognifloyd commented 3 years ago

I just scheduled a get-to-know-you meeting for OpsDroid and StackStorm devs. Please join us on March 11th.

Meeting Link: https://meet.jit.si/stackstorm-opsdroid Date: 11 March 2021 Time:

Agenda Topics (a general/flexible outline of possible topics based on what attendees want to cover):

cognifloyd commented 3 years ago

I just edited my chart above: Basic Teams Support was just merged into OpsDroid https://github.com/opsdroid/opsdroid/pull/1679 Progress!

cognifloyd commented 2 years ago

OK. Here is my initial plan for integrating OpsDroid and StackStorm. I put it in google docs and opened it up so anyone can comment on the doc. Feedback welcome!

https://docs.google.com/document/d/1ycV05-sUwd0Q27ZVjPQAq5nLuUJM_gzaBWOynwdXDJ8/edit?usp=sharing