[BUG] Slack Terraform CI automation timeouts

smoya commented 5 months ago

Describe the bug.

@Shurtu-gal did an excellent job with automating the creation and maintainability of AsyncAPI Slack channels and user groups. See https://github.com/asyncapi/community/issues/1072

However, we faced a blocker issue that makes the Terraform manifest to fail due to timeouts requesting Slack API.

The TF provider is not optimized at all. I have the feeling this code is being executed per each managed Usergroup whenever TF wants to refresh its state: https://github.com/pablovarela/terraform-provider-slack/blob/master/slack/resource_usergroup.go#L108-L128, so potentially we are calling the usergroups.list API method on each usergroup we have.

Expected behavior

I believe we could do some work on the provider repo (it's written in go, seems easy to read and understand at a glance) so we can implement some caching or whatever mechanism we decide. But in short term, I can't see how to fix it.

Screenshots

iTerm2_f9He0j7z

How to Reproduce

terraform apply with the proper Slack Token configured (ask @derberg or me)

🥦 Browser

None

👀 Have you checked for similar open issues?

[X] I checked and didn't find similar issue

🏢 Have you read the Contributing Guidelines?

[X] I have read the Contributing Guidelines

Are you willing to work on this issue ?

None

derberg commented 5 months ago

probably more issues are there - we just merged PR with new WG

https://github.com/asyncapi/community/actions/runs/9446779045/job/26017154335

smoya commented 5 months ago

https://github.com/asyncapi/community/actions/runs/9446779045/job/26017154335

I don't see if this is the issue but what I see is that both the channel and the group have the same handle wg_marketing and that's completely incompatible as mentioned in the header comment of the file:

The handle should be unique and not in use by a member, channel, or another group.

Shurtu-gal commented 5 months ago

The issue is with invalid yaml. If a string has colon it needs to be double quoted. Can be seen here :https://github.com/asyncapi/community/blob/c9ebbc2655ee7ee6117a73cfc40d0c0c1eccaedc/WORKING_GROUPS.yaml#L31

cc: @smoya @derberg

smoya commented 5 months ago

The issue is with invalid yaml. If a string has colon it needs to be double quoted. Can be seen here :

https://github.com/asyncapi/community/blob/c9ebbc2655ee7ee6117a73cfc40d0c0c1eccaedc/WORKING_GROUPS.yaml#L31

cc: @smoya @derberg

Yup, fix is here https://github.com/asyncapi/community/pull/1251

derberg commented 5 months ago

oh thanks, I suggest we need a workflow like https://github.com/asyncapi/community/blob/master/.github/workflows/validate-maintainers.yml#L12-L43 with json schema that we validate against - as these issues will pop up regularly, and with JSON schema you can do lots of validation cases, even pattern validation.

derberg commented 5 months ago

regarding timeouts

workflows have option to react of failure? are we able to parse error in such step, figure it is timeout and retry?

other than that, minimum we can do is drop error in slack, that someone needs to rerun the job, we support such things already - we can have custom message that tags certain people for example

smoya commented 5 months ago

workflows have option to react of failure? are we able to parse error in such step, figure it is timeout and retry?

You can retry as many times you want that it will keep failing. As stated in the description of the issue:

so potentially we are calling the usergroups.list API method on each usergroup we have.

We have more user groups than the API rate limit allows per minute (20 calls). See Google Chrome_OioCRhAF

The issue is that the TF provider seems to be doing one call to such API per group instead of just one for getting all of them (pending to be confirmed but 95% convinced)

derberg commented 5 months ago

You can retry as many times you want that it will keep failing. As stated in the description of the issue:

sorry, that wasn't clear for me. So basically it means automation will always fail atm?

btw - it fails for different reason here https://github.com/asyncapi/community/actions/runs/9464298763/job/26071337562

and what about GitHub teams automation?

smoya commented 5 months ago

btw - it fails for different reason here https://github.com/asyncapi/community/actions/runs/9464298763/job/26071337562

I don't understand such an error. In fact I can't reproduce the same state as in our CI even though the tfstate file is the same. That's weird... @Shurtu-gal any idea? I expect terraform plan in master branch to have the same plan as in the link @derberg shared, but it's not the case in my local env

Examples of things my terraform plan says:

Terraform planned the following actions, but then encountered a problem:

  # module.channels.slack_conversation.channels["01_introductions"] will be updated in-place
  ~ resource "slack_conversation" "channels" {
      + action_on_destroy                  = "archive"
      + action_on_update_permanent_members = "none"
      + adopt_existing_channel             = true
        id                                 = "C023GJWH33K"
        name                               = "01_introductions"
        # (10 unchanged attributes hidden)
    }

  # module.channels.slack_conversation.channels["02_general"] will be updated in-place
  ~ resource "slack_conversation" "channels" {
      + action_on_destroy                  = "archive"
      + action_on_update_permanent_members = "none"
      + adopt_existing_channel             = true
        id                                 = "C34F2JV0U"
        name                               = "02_general"
        # (10 unchanged attributes hidden)
    }

smoya commented 5 months ago

bounty/candidate

Shurtu-gal commented 5 months ago

@smoya checked for various stuff:

Bot account may not be in channel anymore: It is there in 02_general so cannot be the reason.
ID may have changed: That's also not the case.
Finally maybe the app has been uninstalled or deactivated. https://stackoverflow.com/questions/75432078/error-an-api-error-occurred-account-inactive

@derberg you would need to check both the bot-token in secret as well as the app maybe.

smoya commented 2 months ago

It seems there is a fork of the terraform provider that handles timeouts when creating groups. See https://github.com/pablovarela/terraform-provider-slack/pull/223#issuecomment-2179579860

asyncapi / community