department-of-veterans-affairs / va.gov-cms

Editor-centered management for Veteran-centered content.
https://prod.cms.va.gov
GNU General Public License v2.0
99 stars 69 forks source link

Replicate CMS & Next Build communication on CMS side #16851

Open timcosgrove opened 10 months ago

timcosgrove commented 10 months ago

Description

When certain content types are saved in the CMS, they trigger a Content Build content release, regardless of the time of day or day of the week. This is to allow certain timely content to reach va.gov quickly outside of the normal content release schedule.

Next Build should also follow this pattern so that timely content updates are posted quickly.

Details

This functionality is handled by the CMS Drupal module va_gov_content_release: https://github.com/department-of-veterans-affairs/va.gov-cms/tree/main/docroot/modules/custom/va_gov_content_release

This will need to be augmented to handle Next Build content release in addition to Content Build.

Requirements

We want the CMS to manage triggering of Next Build content releases, so that the mechanism matches Content Build. ```[tasklist] ### Acceptance criteria - [ ] https://github.com/department-of-veterans-affairs/va.gov-cms/issues/17022 - [ ] How can we verify the requirements have been met factually/quantifiably? ``` ## Background & implementation details

The current Content Build implementation of this is managed by these custom Drupal modules:

The rough outline of what happens is:

  1. The CMS watches for a state variable to be in the state ready
  2. If the state is ready, it is changed to requested and an API request is made to Github to start the workflow.
  3. Early on in the workflow, the workflow in turn makes an API request back to the CMS to confirm it is started, and the state becomes in progress.
  4. When the workflow completes, success or failure is reported back to the CMS with an API call.
  5. This is the trigger for the CMS to move the state to ready, which triggers the process again.

The implementation of this circular release management should be replicated or extended so that Next Build Content Releases are managed independently of Content Build Content Releases.


Proposed resolution by @alexfinnarn:

Background

While reviewing the current communication between the CMS and content-build GH workflow, I determined that the calls back to Drupal from GitHub are unnecessary. Also, the queue and state machine in Drupal is not needed. Coupled with the work done in https://github.com/department-of-veterans-affairs/va.gov-cms/issues/17209, this means that the queue and state machine code can be removed entirely simplifying the codebase and making it easier to maintain.

Instead of a back and forth between the GH workflow building content and Drupal, the communication can be one-way from Drupal to GH. The continuous build can be triggered from Drupal or GH, and the "out-of-band releases" can be triggered from a Drupal form.

Pros:

Cons:

How will this work?

Continuous Build

Script code: https://github.com/department-of-veterans-affairs/va.gov-cms/pull/17991/files#diff-25983c2c72210bfdda46b5543864aa3bf89c459a52f2c93d98240e70a0c1d93f

  1. A script runs in the background to check if a request should be made to GH to start a build.
  2. The script first checks if it is during business hours, code taken from the trait RunsDuringBusinessHours::isCurrentlyDuringBusinessHours().
  3. If outside of business hours, the script exits.
  4. If within business hours, the script pings GH to see if a build is running.
  5. If a build is running the script exits.
  6. If a build isn't running, ping GH to start a build.
  7. Logs are sent wherever they need to be sent based on the checks.

Note: The process can live entirely on GH in a workflow as an alternative to Drupal making API calls. See https://github.com/department-of-veterans-affairs/va.gov-cms/issues/16851#issuecomment-2077804382 for how that works.

From docs:

Two known error states exist and are handled:

The CMS doesn't receive any status notifications from the GHA workflow. In this case, the state is considered stale after 40 minutes and will be reset so that another release can be kicked off. The CMS is notified by GHA that a release failed. In this case, the release state in the CMS is reset and a new release will be requested as a retry.

It might be useful to add a specific GH workflow for these errors rather than solely relying on the CMS to trigger things. In my tests, the continuous GH content build workflow kicks off within 10 minutes of the last one completing.

OOB release

Form code: https://github.com/department-of-veterans-affairs/va.gov-cms/pull/17991/files#diff-3d3bee748cdbebe5f5a0a8256fb4dcdfa5e486d77d64bf85c6cd7256e4649e68

  1. User goes to /admin/content/deploy/check-status (or a better sounding route) to check the status of builds.
  2. User sees build information like in the screenshot below.
  3. If it is outside of business hours and a build isn't running, the user will have the choice to fill in the form and submit.
  4. User checks the agreement checkbox understanding what will be published.
  5. An API call is sent to GH to start the build.
  6. User refreshes the page (can make auto-refresh if wanted) to see the build has kicked off.
  7. User checks back later to see if the build reports "completed" and checks the live site to see content is published.
Screenshot 2024-04-22 at 1 48 27 PM

As far as I know, this covers the current functionality in code and in the docs. The content release workflow should also have the ability to be workflow_dispatch called within GH for any dev, but that is already in the next-build workflow.

Notes about current docs

Content releases can be requested in one of four ways:

Automatically via a self-managed schedule in the CMS Automatically when some types of content are edited Manually when an editor requests a content release Manually when a member of the devops or release tools team requests a build

That is from the cms-content-release.md readme, but "Automatically when some types of content are edited" never really happens. There's always an item in the content-release queue from the continuous build making any queue item for an individual node pointless...plus, it looks like the...

Ugg, I'm not following this code around anymore...you spin me round, round, baby, wrong round...The docs are outdated for the EntityEventSubscriber function, and I stopped after finding ContentReleaseTriggerTrait in the va_gov_content_types module. With several modules dedicated to the content release, why put code in that module? So there could be some code that does trigger an OOB release, but good luck tracing through the code to figure it out.

Remaining Questions?

  1. Can Datadog GH integration be used? There are metrics calculated that Datadog could simply ingest vs. needing to calculate them in a GH workflow run.
  2. Does any content update trigger a release after hours? That is not accounted for in this plan, but would only be important after hours...and if someone is making the change, can't they simply submit the content release form at the same time they update content? Probably so.
  3. Should the cms-content-release.md readme be updated? It seems out of date mentioning a class that has been refactored and moved to a different module.
  4. Can the continuous build portion take place entirely on GH with two Workflows? One is the content-release.yml workflow and the other workflow runs checks and calls the release workflow. GH allows for checks every five minutes but realistically this takes 5 - 10 minutes for the runners to kick off. API calls or manual intervention can enable/disable the workflows.
  5. Would adding the "environment" to the build request form help? It could be locked on prod to prod or locked for certain user roles, but this reduces the need for a build type to be in settings, e.g. $settings['va_gov_frontend_build_type'] and $settings['github_actions_deploy_env'].
alexfinnarn commented 7 months ago

My first question with this ticket is who the CMS status notices are meant for?

The build could happen in the background without people using the CMS needing to be aware of the build. So, I think the CMS knowing the state of the content build is only for someone wanting to click the button to schedule a release, and that is only important outside of business hours since otherwise, the build is continuous.

I think the actual workflow for building the current content is located here: https://github.com/department-of-veterans-affairs/content-build/blob/main/.github/workflows/content-release.yml

In the va.gov-cms codebase, I see https://github.com/department-of-veterans-affairs/va.gov-cms/blob/main/docroot/modules/custom/va_gov_github/src/Commands/ApiClientCommands.php#L238 as where the code might make a dispatch request to that workflow, but I can't find where that drush command is being used, if it is at all.

My take is to use the GH Workflow as the source of truth and have the CMS ping the workflow run to determine status rather than trying to make the CMS the source of truth.

I know from looking at the queues before that the queue payload is not used and therefore I don't think the queue is necessary. Furthermore, the state machine always progressed from ready back to ready without any other meaningful state transitions. To me, it looks like the status is always set back to ready and then moved onto complete or back to ready. So, it's like a boolean isBuilding flag more than a state machine...but I could be missing something.

I have to look at the code more, but relying on GitHub to update the CMS could end with the build stuck in pending whereas pinging a certain workflow run should always return something even if it is not a 200. Also, there is a decent bit of code that pings the CMS with authenticated requests that I don't think are necessary since the CMS can ping GitHub to figure out the status of things....once again, I could be missing things, but I think the CMS pinging GH to start a run or check the status will be easier to maintain than having both the CMS and GH pinging back and forth.

Here's how this would go, following along with the way I see it currently done in the content-release workflow.

  1. The CMS has to know the status of any current GH Workflows building content when a user goes to a page where they can request a build or view the status of the build. This is also true for the continuous build. Whether some state variable is ready or not doesn't really matter IMHO. GH is the source of truth. I think the status of any content-release workflow could be obtained by filtering the list of current workflow runs by workflow ID: https://docs.github.com/en/rest/actions/workflow-runs?apiVersion=2022-11-28#list-workflow-runs-for-a-repository
  2. Based on the workflow runs and status, the build can be triggered or it can't and something is logged that the build started or the build is still running. If a user is on a screen trying to trigger a build, loading the page pings the API to detect the status.
  3. Perhaps keep a variable that is a buildrequest boolean so users can click to release content and have something happen vs. checking back when a build isn't running. Currently, when a user clicks the button it adds a queue item but realistically there's always other items in the queue so I guess it only matters after hours.
  4. The UI for releasing content can ping the API when it loads but it should also ping periodically so the user can keep the page open and get updates without refreshing.
  5. Build time variables are not needed since they can be derived from the GH API as long as the build is split into multiple jobs.
  6. I think step metrics should be derived from the GH API by splitting them up into jobs. Also Datadog has a GH Actions integration that ingests logs so no need to send that data over if it can be used https://www.datadoghq.com/blog/datadog-github-actions-ci-visibility/

So, I will now look into the first step of how the CMS knows if a content release workflow is running, and if it is, how the status of the run is reported. If no build is running and it is outside of the continuous release hours (or really just some canUserTriggerBuild logic), then the user can hit the button to trigger a build.

I'm not even sure using the va_gov_github module will be useful since I see it wrapping Github\Client with a bunch of abstraction but no extra features that I can tell...at least it will keep the complexity lower to keep that part out for now. I will use the form on "/admin/content/deploy/simple" to test getting GH Worfklows info using Github\Client.

alexfinnarn commented 7 months ago

and I'm guessing this is the next-build workflow that should be referenced in any API calls: https://github.com/department-of-veterans-affairs/next-build/blob/main/.github/workflows/content-release.yml

alexfinnarn commented 7 months ago

I was thinking through how to have the content-release workflow run on demand as well as on a schedule using as much of the GitHub API as possible, and the workflow_dispatch + enable/disable workflow endpoint seems to cover the use case.

Workflows can be enabled and disabled via an endpoint: https://docs.github.com/en/rest/actions/workflows?apiVersion=2022-11-28#disable-a-workflow This can be used to turn off continuous building by having a workflow that runs during the scheduled business hours enabled or disabled with an API call.

When set to enabled this workflow would call the content-release workflow whenever the content-release workflow completes. You can do different things if the workflow completes or fails: https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#running-a-workflow-based-on-the-conclusion-of-another-workflow

on:
  workflow_run:
    workflows: [Content Release]
    types: [completed]

jobs:
  on-success:
    runs-on: ubuntu-latest
    if: ${{ github.event.workflow_run.conclusion == 'success' }}
    steps:
      # Might need to use cURL but weird if you can't use the gh CLI in a Workflow...
      - run: gh workflow run content-release.yml
  on-failure:
    runs-on: ubuntu-latest
    if: ${{ github.event.workflow_run.conclusion == 'failure' }}
    steps:
      - run: ./log-stuff.sh

Another workflow could accept a workflow_dispatch event and then call the content-release workflow. This would be the on-demand after-hours deployment request.

All API calls to GitHub could originate in the CMS either from user input or from a background script. However, I only think it makes sense to have on-demand and disable/enable workflow calls coming from the Drupal CMS. The continuous building shouldn't require any communication from the CMS, at least I don't see why it would be necessary.

Since some of the "workflows calling workflows" code needs to have things on a main/default branch, I think I will use my personal repos to test this out as I've never attempted much more than very basic GH Workflows. Also, I'm not really testing the workflows, I'm simply trying to test how to call, check status, disable/enable, and re-use workflows. I'll still make a branch for the Drupal code, but it will start by making API calls to test repos I control so as to not disturb VA devs.

alexfinnarn commented 7 months ago

I updated the content status release form with details from the GH workflow. This is only targeting the production content release details, but the QA/testing/Tugboat content release uses a log to show what the status is. I think the QA/testing/Tugboat release code should be updated to use commit flags and thus remove all needs for the state machine and content release queues.

Screenshot 2024-04-22 at 1 48 27 PM

The status block shows:

  1. The previous build status from the GH workflow, which should be "completed".
  2. The current build status from the GH workflow, which should be "in_progress".
  3. Last run start time, which could be better formatted.
  4. Duration of the current build workflow.
  5. Whether the current time is within business hours.

If the time is outside of business hours, then the content release request form is disabled. Otherwise, someone can check the acknowledgment checkbox and submit the form to make a request to run the GH Worfklow.

That is the basic outline of the UI changes I'm proposing and have committed to code. I will now look into implementing the GH Workflow code I mentioned previously that will trigger the content release workflow.

alexfinnarn commented 7 months ago

I'm testing this out on a personal repo: https://github.com/alexfinnarn/moz/actions with two workflows: one to run something and the other to watch for when it completes and re-run the workflow. This should work, but I'm running into a token error.

Run gh workflow run content_release.yml
  gh workflow run content_release.yml
  shell: /usr/bin/bash -e {0}
  env:
    GH_TOKEN: ***
could not create workflow dispatch event: HTTP 403 Resource not accessible by integration

Angry threads about this:

I think this can be taken care of with a PAT, but it would make more sense for the CLI to just work. Some kind of concern from GH about recursive workflows or something...

alexfinnarn commented 7 months ago
Screenshot 2024-04-23 at 3 25 36 PM

Total BS. You should be able to continuously run within the workflow if there are restrictions with calling workflows from other workflows. I guess I will try using the GH token over REST endpoint now, but that could get me the same error as with the CLI command.

alexfinnarn commented 7 months ago

As I'm learning about GH Workflows and how to make them more dynamic and re-usable, here is an example of taking current code and making it more dynamic.

The current content-build content release workflow has a line about some debug boolean variable: https://github.com/department-of-veterans-affairs/content-build/blob/main/.github/workflows/content-release.yml#L22 To update this variable, you would have to edit via GH admin UI or via an API call.

However, you can also use workflow_call and input variables to allow for the dispatch call to provide debug variable information like this:

on:
  workflow_call:
    inputs:
      debug:
        required: true
        type: string

 env:
    ACTIONS_RUNNER_DEBUG: ${{ inputs.debug }}

I bet passing in variables as inputs to workflows could help in several places.

alexfinnarn commented 7 months ago

Based on the answer from https://stackoverflow.com/a/75250838 I was able to go to my settings and change the permissions to read and write instead of only read:

Screenshot 2024-04-24 at 1 47 49 PM

And this did allow for the flow: Content Release -> Continuous Release -> Content Release

However, it stopped after triggering the content release once. On https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#workflow_run there is a note:

You can't use workflow_run to chain together more than three levels of workflows. For example, if you attempt to trigger five workflows (named B to F) to run sequentially after an initial workflow A has run (that is: A → B → C → D → E → F), workflows E and F will not be run.

I only count two chains in my example, but solely using GH Worfklows to trigger other workflows in a loop might be impossible.

So at this point, it's probably smartest to use a schedule that checks every five minutes since that is the shortest interval of time allowed. Check if it is during business hours and if a content release workflow is running. If not, then call the content release workflow.

alexfinnarn commented 7 months ago

I figured out how to get a workflow to run continuously solely on GH via Worfklows. I will post the complete workflow file since the workflow will never live in this CMS repo and it would need to be added to next-build. I tested this on a personal repo.

name: Continuous Release

on:
  workflow_dispatch:
  schedule:
    # Run every five minutes.
    - cron: '*/5 * * * *'

jobs:
  trigger_release:
    env:
      GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v2
        with:
          persist-credentials: false

      - name: During business hours?
        # Check to see if it is a weekday between 8am and 8pm in "America/New_York" timezone.
        # Times taken from RunsDuringBusinessHours.php::isCurrentlyDuringBusinessHours()
        run: |
            export TZ="America/New_York"
            echo "Current time: $(date)"
            if [ $(date +%u) -lt 6 ] && [ $(date +%H) -ge 8 ] && [ $(date +%H) -lt 20 ]; then
              echo "It is during business hours."
              echo "BUSINESS_HOURS=true" >> $GITHUB_ENV
            else
              echo "It is not during business hours."
              echo "BUSINESS_HOURS=false" >> $GITHUB_ENV
            fi

      - name: Content Release running?
        run: |
            RUNNING_WORKFLOWS=$(gh run list --workflow "Content Release" --json status --jq '.[] | select(.status == "in_progress")')
            if [ -n "$RUNNING_WORKFLOWS" ]; then
                echo "Content Release is already running."
                echo "RELEASE_WORKFLOW_RUNNING=true" >> $GITHUB_ENV
            else
                echo "Content Release is not running."
                echo "RELEASE_WORKFLOW_RUNNING=false" >> $GITHUB_ENV
            fi

      - name: Run Content Release
        if: env.BUSINESS_HOURS == 'true' && env.RELEASE_WORKFLOW_RUNNING == 'false'
        run: |
          gh workflow run content_release.yml

The content_release.yml workflow is within the same repo so it can easily be run with the GH token. One check looks to see if the time is during business hours, and the other check determines if the content build is already running. If it is within business hours and there is no current build running, the content release/build workflow gets kicked off.

The GH runners don't operate exactly every five minutes but seemed to run within 10 minutes. So, this isn't entirely continuous per se, but it does keep the content release going during business hours without needing the CMS to function. Granted, the content build can't succeed without the CMS so it is a moot point, but at least the sample code provides an option to do things this way.

I will now look into doing the same thing via a script like the queue_runner.sh that runs continuously on Tugboat. Theoretically, a script could run the same code as in the workflow file I pasted and then make an API call to dispatch a content build workflow run.

The benefit is checking more often for when the content build is not running to trigger it. This code can go in this repo in the branch I am adding sample code to.