SpineEventEngine / config

Dependencies and build configurations shared among subprojects
https://spine.io
Apache License 2.0
2 stars 3 forks source link

Escalate failed `master` builds via Google Chat #403

Open armiol opened 4 years ago

armiol commented 4 years ago

Currently there is no escalation mechanism in case CI build fails in the master branch.

In order to notify the team members of such events, a Google Chat bot should be implemented.

Supported functionality

  1. Every hour the bot should review the status of the latest builds in master branches of Spine libraries:
  1. In case the latest master branch build is failed, the bot should create a new thread in the "Spine Developers" room in the Google Chat. The message should list
  1. In case the build is not fixed in an hour, the bot sends the corresponding message to the same thread again.

  2. If the build is fixed, the bot notifies the group via the same thread that the issue is now resolved.

Hard reset

As an optional step, the bot may be configured to perform the hard reset to the latest successful commit to the master. Such a drastic measure may be taken if the build does not get a fix in 12 hours since the initial issue discovery.

However, such an operation is a risky action, as the commit history may be lost. So the hard reset option needs to be discussed additionally, perhaps, with an inclusion of a repository backup procedure (i.e. an upload of the whole repository along with its history to the Google Storage).

Implementation

The chat bot application should be written in Spine. Once done, it will be hosted in the Google Cloud.

The codebase for the chat bot should reside in the distinct SpineEventEngine/chat-bot repository.

alexander-yevsyukov commented 4 years ago

@armiol, are you happy with the ChatBot implementation? Can we close this?

alexander-yevsyukov commented 2 years ago

@armiol, I haven't heard from ChatBot recently. Are we that accurate with builds, or is ChatBot dead?

armiol commented 2 years ago

@alexander-yevsyukov That's because we don't have any broken builds on Travis. We are pretty accurate with that.

We'll have to discuss how to act in case one of GH Actions runs fails. For instance, our Windows-based builds aren't very stable, and what we really need for them is to restart first. And if it fails again, then report.