dflook / terraform-github-actions

GitHub actions for terraform
767 stars 151 forks source link

Support non-ephemeral runners #104

Closed toast-gear closed 2 years ago

toast-gear commented 3 years ago

If docker changes any files / folders on disk it changes the owner of the file / folder to be root root. This causes problems for runners that are not ephemeral as a subsequent run of a workflow will fail due to the checkout action being unable to clean the folder due to permission errors on the .terraform folder.

This can be worked around via peter-murray/reset-workspace-ownership-action:

      - name: Get Actions user id
        id: get_uid
        run: |
          actions_user_id=`id -u $USER`
          echo $actions_user_id
          echo ::set-output name=uid::$actions_user_id
      - name: Correct Ownership in GITHUB_WORKSPACE directory
        uses: peter-murray/reset-workspace-ownership-action@v1
        with:
          user_id: ${{ steps.get_uid.outputs.uid }}

This is a faff however and it would be nicer if this issue could be resolved natively without the need for yet another action. I've raised a PR in another terraform action which has the same problem https://github.com/bridgecrewio/checkov-action/pull/59. Would this a solution be workable here too?

dflook commented 3 years ago

GitHub insists on running docker based actions as root, so this will affect any docker based action that writes to disk.

Can you confirm there is a permission problem with the .terraform directory? This should not be written to the workspace. I can imagine this is an issue for the plan outputs of dflook/terraform-plan that were added in v1.16.0 though.

I have to admit, I don't understand the linked PR. As far as I can tell, the '--user' flag is for docker - but the args are passed to the entrypoint? Or does the entrypoint have to be updated to use the --user flag?

toast-gear commented 3 years ago

I can imagine this is an issue for the plan outputs of dflook/terraform-plan that were added in v1.16.0 though.

yes, it will impact an action that writes to disk on a mounted volume, not all terraform commands write to .terraform/.

Can you confirm there is a permission problem with the .terraform directory? This should not be written to the workspace.

Let me get screenshots tomorrow. We have peter-murray/reset-workspace-ownership-action in basically every terraform pipeline and so I'll need to make a custom one to demonstrate the issue.

I have to admit, I don't understand the linked PR. As far as I can tell, the '--user' flag is for docker - but the args are passed to the entrypoint? Or does the entrypoint have to be updated to use the --user flag?

In the linked PR example, the change makes it so that when the docker://bridgecrew/checkov:2.0.469 image is ran, it is ran under the user id specified, if one is provided. The id provided is the user id of the user running the runner service, it doens't need to exist in the container. All it is doing is setting the docker --user flag so that the container is ran under a user ID matches the user running the runner service. As a result when the checkout action goes to do a git clean on teh subsequent run there is not permission conflict.

If we were using ephemeral runners this entire problem would be circumvented however my company isn't and it is going to take a while to migrate them away from them.

dflook commented 3 years ago

v1.17.1 was just released, which fixes the ownership of any files created in runner mounted directories like the workspace

toast-gear commented 3 years ago

Cheers pal, should hopefully get around to testing it on my end in the next few days, looks solid though so I'm sure the fix works a treat.

toast-gear commented 3 years ago

I think this is causing problems in our environment. dflook/terraform-validate is bombing out.

Failing jobs are using : danielflook/terraform-github-actions@sha256:7340e0fda478b550b89feaa389a4397946e29a841f86ac39397a771ba205e06e Success jobs are using danielflook/terraform-github-actions@sha256:07cd443fbd4fc64bddf6901cfb1e6daff9f4b3935e68324bf40d395fb2ad6a7f

The errors we're seeing are:

The pipeline step that is failing:

      - name: Terraform Validate
        uses: dflook/terraform-validate@v1
        with:
          path: ${{ matrix.submodule }}
          label: ${{ matrix.submodule }}
        env:
          TERRAFORM_SSH_KEY: ${{ secrets.SSH_KEY }}

and our stategy is:

    strategy:
      fail-fast: false
      matrix:
        include:
          # tf contained under the child folders
          - submodule: folder-at-root-of-repo/folder
          - submodule: folder-at-root-of-repo/folder
  Error: Failed to install provider from shared cache

  Error while importing hashicorp/kubernetes v2.5.0 from the shared cache
  directory: provider binary not found: could not find executable file starting
  with terraform-provider-kubernetes.

...

 6 problems:

- Failed to instantiate provider "registry.terraform.io/hashicorp/aws" to
obtain schema: unknown provider "registry.terraform.io/hashicorp/aws"
- Failed to instantiate provider "registry.terraform.io/hashicorp/kubernetes"
to obtain schema: unknown provider
"registry.terraform.io/hashicorp/kubernetes"
- Failed to instantiate provider "registry.terraform.io/hashicorp/local" to
obtain schema: unknown provider "registry.terraform.io/hashicorp/local"
- Failed to instantiate provider "registry.terraform.io/hashicorp/null" to
obtain schema: unknown provider "registry.terraform.io/hashicorp/null"
- Failed to instantiate provider "registry.terraform.io/hashicorp/random" to
obtain schema: unknown provider "registry.terraform.io/hashicorp/random"
- Failed to instantiate provider "registry.terraform.io/hashicorp/template" to
obtain schema: unknown provider "registry.terraform.io/hashicorp/template"

Getting the teams to pin back to the previous release to see if that fixes it

dflook commented 3 years ago

I'd like to be able to reproduce this, what can you tell me about how your runners are setup? Could you enable debug logging by setting the ACTIONS_STEP_DEBUG secret to true. Do you use any other docker based actions that use terraform?

toast-gear commented 3 years ago

I'd like to be able to reproduce this, what can you tell me about how your runners are setup?

static runners unfortunately :(

Could you enable debug logging by setting the ACTIONS_STEP_DEBUG secret to true.

this produces quite a lot of output so I'll try picking out the bits that look useful:

terraform binary selection:

  ##[debug]ls -lad /root/.terraform.versions:lrwxrwxrwx 1 root root 63 Oct  7 15:57 /root/.terraform.versions -> /github/home/.dflook-terraform-github-actions/terraform-bin-dir
  ##[debug]ls -lad /github/home/.dflook-terraform-github-actions/terraform-bin-dir:drwxr-xr-x 3 51982 51982 4096 Oct  7 15:57 /github/home/.dflook-terraform-github-actions/terraform-bin-dir
  ##[debug]ls -la /github/home/.dflook-terraform-github-actions/terraform-bin-dir:total 80828
  ##[debug]ls -la /github/home/.dflook-terraform-github-actions/terraform-bin-dir:drwxr-xr-x 3 51982 51982     4096 Oct  7 15:57 .
  ##[debug]ls -la /github/home/.dflook-terraform-github-actions/terraform-bin-dir:drwxr-xr-x 3 51982 51982     4096 Oct  4 15:04 ..
  ##[debug]ls -la /github/home/.dflook-terraform-github-actions/terraform-bin-dir:drwxr-xr-x 2 51982 51982     4096 Oct  6 17:16 .terraform.versions.default
  ##[debug]ls -la /github/home/.dflook-terraform-github-actions/terraform-bin-dir:-rw-r--r-- 1 51982 51982        8 Oct  7 15:57 RECENT
  ##[debug]ls -la /github/home/.dflook-terraform-github-actions/terraform-bin-dir:-rwxr-xr-x 1 51982 51982 82749972 Oct  7 15:57 terraform_0.14.11
  ##[debug]tfswitch --version:
  ##[debug]tfswitch --version:Version: 0.8.832
  Reading required version from terraform file, constraint: ~> 0.14.0
  Switched terraform to version "0.14.11" 
  ##[debug]ls -la /usr/local/bin/terraform:lrwxrwxrwx 1 root root 43 Oct  7 15:57 /usr/local/bin/terraform -> /root/.terraform.versions/terraform_0.14.11
  ##[debug] Terraform version major 0 minor 14 patch 11
  ##[debug]ls -la /github/home/.dflook-terraform-github-actions/terraform-bin-dir:total 80828
  ##[debug]ls -la /github/home/.dflook-terraform-github-actions/terraform-bin-dir:drwxr-xr-x 3 51982 51982     4096 Oct  7 15:57 .
  ##[debug]ls -la /github/home/.dflook-terraform-github-actions/terraform-bin-dir:drwxr-xr-x 3 51982 51982     4096 Oct  4 15:04 ..
  ##[debug]ls -la /github/home/.dflook-terraform-github-actions/terraform-bin-dir:drwxr-xr-x 2 51982 51982     4096 Oct  6 17:16 .terraform.versions.default
  ##[debug]ls -la /github/home/.dflook-terraform-github-actions/terraform-bin-dir:-rw-r--r-- 1 51982 51982        8 Oct  7 15:57 RECENT
  ##[debug]ls -la /github/home/.dflook-terraform-github-actions/terraform-bin-dir:-rwxr-xr-x 1 51982 51982 82749972 Oct  7 15:57 terraform_0.14.11
  ::endgroup::

/github/home permissions

##[debug]ls -la /github/home:total 1400
##[debug]ls -la /github/home:drwxr-xr-x 346 51982 51982 20480 Oct  7 11:42 .
##[debug]ls -la /github/home:drwxr-xr-x   6 root  root   4096 Oct  7 15:02 ..
##[debug]ls -la /github/home:drwxr-xr-x   3 root  root   4096 Sep 30 20:16 .cache
##[debug]ls -la /github/home:drwxr-xr-x   2 root  root   4096 Sep 30 14:33 .dflook-terraform-bin-dir
##[debug]ls -la /github/home:drwxr-xr-x   4 root  root   4096 Oct  4 08:59 .dflook-terraform-data-dir
##[debug]ls -la /github/home:drwxr-xr-x   3 51982 51982  4096 Oct  4 15:04 .dflook-terraform-github-actions
##[debug]ls -la /github/home:drwxr-xr-x   3 51982 51982  4096 Sep 30 14:33 .terraform.d
##[debug]ls -la /github/home:drwxr-xr-x   2 root  root   4096 Oct  6 19:48 .yor_plugins
##[debug]ls -la /github/home:drwxr-xr-x   2 root  root   4096 Sep 30 14:33 1291442970-krygefxy
...

the above is in contrast to the workspace where the user running the runner service owns everything:

/github/workspace permissions

##[debug]pwd:/github/workspace
##[debug]ls -la:drwxr-xr-x 17 51982 51982 4096 Oct  7 15:02 .
##[debug]ls -la:drwxr-xr-x  6 root  root  4096 Oct  7 15:02 ..
##[debug]ls -la:drwxr-xr-x  8 51982 51982 4096 Oct  7 15:02 .git
##[debug]ls -la:drwxr-xr-x  3 51982 51982 4096 Sep 30 20:10 .github
##[debug]ls -la:-rw-r--r--  1 51982 51982   30 Sep 30 20:10 .gitignore
##[debug]ls -la:-rw-r--r--  1 51982 51982  342 Sep 30 20:10 README.md
...

terraform init output below

  Downloading cloudposse/label/null 0.24.1 for web-node-group.label...
  - web-node-group.label in /tmp/terraform-data-dir/modules/web-node-group.label
  Downloading cloudposse/label/null 0.24.1 for web-node-group.this...
  - web-node-group.this in /tmp/terraform-data-dir/modules/web-node-group.this

  Initializing provider plugins...
  - terraform.io/builtin/terraform is built in to Terraform
  - Finding hashicorp/aws versions matching ">= 2.0.0, >= 3.0.0"...
  - Finding hashicorp/kubernetes versions matching ">= 1.0.0"...
  - Finding hashicorp/tls versions matching ">= 2.2.0"...
  - Finding hashicorp/template versions matching ">= 2.0.0"...
  - Finding hashicorp/null versions matching ">= 2.0.0"...
  - Finding hashicorp/local versions matching ">= 1.3.0"...
  - Finding hashicorp/random versions matching ">= 2.0.0"...
  - Using hashicorp/kubernetes v2.5.0 from the shared cache directory
  - Using hashicorp/tls v3.1.0 from the shared cache directory
  - Using hashicorp/template v2.2.0 from the shared cache directory
  - Using hashicorp/null v3.1.0 from the shared cache directory
  - Using hashicorp/local v2.1.0 from the shared cache directory

  Error: Failed to install provider from shared cache

  Error while importing hashicorp/kubernetes v2.5.0 from the shared cache
  directory: provider binary not found: could not find executable file starting
  with terraform-provider-kubernetes.

...

schema problems?

7 problems:

- Failed to instantiate provider "registry.terraform.io/hashicorp/aws" to
obtain schema: unknown provider "registry.terraform.io/hashicorp/aws"
- Failed to instantiate provider "registry.terraform.io/hashicorp/kubernetes"
to obtain schema: unknown provider
...

could it be that the ownership permissions need to do as a post-entrypoint script process?

sanitised pipeline:

on: [pull_request]

jobs:
  Terraform-plan:
    runs-on: self-hosted
    env:
      GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
    strategy:
      fail-fast: false
      matrix:
        include:
          # There is terraform under the child folders, backend.tf, providers.tf, locals.tf, main.tf (which sources a terraform module from git)
          - submodule: parentFolder/child-1
          - submodule: parentFolder/child-2
    steps:

      # I am aiming to have this action removed through this issue, isn't removed yet as a double chown shouldn't matter
      - name: Get Actions user id
        id: get_uid
        run: |
          actions_user_id=`id -u $USER`
          echo $actions_user_id
          echo ::set-output name=uid::$actions_user_id
      - name: Correct Ownership in GITHUB_WORKSPACE directory
        uses: peter-murray/reset-workspace-ownership-action@v1
        with:
          user_id: ${{ steps.get_uid.outputs.uid }}

      - name: Checkout from GitHub
        uses: actions/checkout@v2

      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v1
        with:
          role-to-assume: arn:aws:iam::***************:role/my-amazing-role
          aws-region: ***************
          role-duration-seconds: 900

      - name: Terraform Linting (base)
        uses: dflook/terraform-fmt@v1
        with:
          path: ${{ matrix.submodule }}
          label: ${{ matrix.submodule }}
        env:
          TERRAFORM_SSH_KEY: ${{ secrets.SSH_KEY }}

      # atm it blows up here before we get to the plan
      - name: Terraform Validate
        uses: dflook/terraform-validate@v1
        with:
          path: ${{ matrix.submodule }}
          label: ${{ matrix.submodule }}
        env:
          TERRAFORM_SSH_KEY: ${{ secrets.SSH_KEY }}

      - name: Terraform Plan
        uses: dflook/terraform-plan@v1
        with:
          path: ${{ matrix.submodule }}
          label: ${{ matrix.submodule }}
        env:
          TERRAFORM_SSH_KEY: ${{ secrets.SSH_KEY }}
toast-gear commented 3 years ago

I don't think I can provide much more detail because this is all in the container there isn't a way for me to get any output about the state of the files on disk outside of the mounted volumes. Where should the terraform init artefacts end up on disk?

dflook commented 2 years ago

.teraform is a temporary directory in the container, so every step starts with it empty. Do you have multiple runner processes running on the same host, do they share the runner.temp directory? How many times did it fail this way?

toast-gear commented 2 years ago

I refreshed our runner instances and it solved that weird init issue. I then however put our actions back to using the v1 tag and we got the below errors from the checkout action trying to clean the repo:

Cleaning the repository:
  /usr/bin/git clean -ffdx
  warning: failed to remove .dflook-terraform-github-actions/hlgwsdhe/plan.txt: Permission denied
  warning: failed to remove .dflook-terraform-github-actions/hlgwsdhe/plan.json: Permission denied
  warning: failed to remove .dflook-terraform/token-cache/as78d568sf568ds5f6d7s5f67ds5fd7s5fd67s5f67ds5f7ds5f7ds65f7d6s5f67ds: Permission denied
  Removing parentFolder/child-1/.terraform.lock.hcl
Warning: Unable to clean or reset the repository. The repository will be recreated instead.
Deleting the contents of '/actions-runner/_work/repo/repo'
Error: Command failed: rm -rf "/actions-runner/_work/repo/repo/.dflook-terraform"

I think it's just more files that need the ownership fix applied to them

Most of them from the looks of it are generated by this action in all cases. In the case of .terraform.lock.hcl we should be checking that into source but we don't on all repos and so the action should probably assume it may be generating it and so needs the ownership fix if it does.

dflook commented 2 years ago

It looks like those files were left behind by an old version of the these actions before the ownership fix. Can you manually delete the workspace and try again?

toast-gear commented 2 years ago

sure, let me refresh now, will respond within 10 mins

toast-gear commented 2 years ago

kicked it off, I thought however looking at the diff https://github.com/dflook/terraform-github-actions/compare/v1.17.0...v1.17.1 some of those files would still have the issue e.g. the plan.* and .dflook-terraform/?

dflook commented 2 years ago

Everything in 1.17.1 is now inside .dflook-terraform-github-actions, which gets the ownership changed recursively

toast-gear commented 2 years ago

Yeh you're right. After a refresh, I monitored the workflow run on disk and it worked as expected. Running the workflow twice didn't result in any clean errors. Issue is resolved from my perspective and can be closed.

Thanks for looking into this as quickly as you did, much appreciated pal.

dflook commented 2 years ago

Great, glad it's working now!