GalleyBytes / terraform-operator

A Kubernetes CRD to handle terraform operations
http://tf.galleybytes.com
Apache License 2.0
365 stars 47 forks source link

Bad owner or permissions on ssh config when using private GIT repo #116

Open odise opened 2 years ago

odise commented 2 years ago

I noticed some weird behaviour when TF code to be deployed is depending on private GIT repositories. The terraform init step fails with the following error:

Connecting to raw.githubusercontent.com (185.199.109.133:443)
saving to '/home/tfo-runner/generations/6/init.sh'
init.sh              100% |********************************|  2762  0:00:00 ETA
'/home/tfo-runner/generations/6/init.sh' saved
Initializing modules...
Downloading git@github.com:xxx/lqt-gitops.git?ref=DOC-1171 for db...

│ Error: Failed to download module
│
│ Could not download module "db" (main.tf:29) source code from
│ "git@github.com:xxx/gitops.git?ref=DOC-1171": error downloading
│ 'ssh://git@github.com/xxx/gitops.git?ref=DOC-1171': /usr/bin/git
│ exited with 128: Cloning into '.terraform/modules/db'...
│ Bad owner or permissions on /home/tfo-runner/.ssh/config
│ fatal: Could not read from remote repository.
│
│ Please make sure you have the correct access rights
│ and the repository exists.

Interestingly the setup step runs successful although it depends on an internal GIT repository. The error above is a result of terraform init trying to download further modules from the same GIT repo.

Here is my Terraform manifest:

apiVersion: tf.isaaguilar.com/v1alpha2
kind: Terraform
metadata:
  name: example
  namespace: lqt-srv-1
spec:
  terraformVersion: 1.0.0
  # Pull this module to execute
  terraformModule:
    source: "git@github.com:xxx/gitops.git//some/module?ref=BRANCH"

  # Use kubernetes as a backend which is available for terraform >= v0.13
  backend: |-
    terraform {
      backend "s3" {
        region         = "eu-central-1"
        bucket         = "tf-state"
        key            = "terraform.tfstate"
      }
    }
  ignoreDelete: false

  # Create a tfvar env for the terraform to use
  taskOptions:
  - for:
    - '*' # The following config affects all task pods
    env:
    - name: TF_VAR_vpc_name
      value: blah
    - name: TF_VAR_client_id
      value: acme
    - name: TF_VAR_environment
      value: test

  scmAuthMethods:
    - git:
        ssh:
          sshKeySecretRef:
            key: github-ssh-key
            name: github-ssh-key-aws-managed-key
            namespace: lqt-srv-1
      host: github.com

  keepCompletedPods: true
  keepLatestPodsOnly: true
  serviceAccount: tf-operator-service-account

I defines a preinit step to investigate the /home/tfo-runner/.ssh directory and found this:

bash-5.1$ ls -al /home/tfo-runner/.ssh/
total 16
drwxrwsr-x    2 tfo-runn 2000          4096 Oct  7 12:40 .
drwxrwsr-x    6 root     2000          4096 Oct  6 18:14 ..
-rw-rw----    1 tfo-runn 2000           126 Oct  7 12:40 config
-rw-rw----    1 tfo-runn 2000           399 Oct  7 12:40 github.com

All attempts to change the permissions within preinit failed in init with same result.

I'm using the Helm chart v0.2.15 from https://galleybytes.github.io/helm-charts which installs terraform-operator:v0.9.0-pre3.

isaaguilar commented 2 years ago

I was able to replicate the issue on the ARM64v8 architecture, but not on the AMD64. I'll have to dig to see why the different builds have different results for ssh keys.

I ran a test as similar as possible to yours. This is the config that works for amd and not for arm: ```yaml apiVersion: tf.isaaguilar.com/v1alpha2 kind: Terraform metadata: name: example namespace: default spec: terraformVersion: 1.0.0 # Pull this module to execute terraformModule: source: "git@github.com:isaaguilar/simple-aws-tf-modules.git//private-github-module" # Use kubernetes as a backend which is available for terraform >= v0.13 backend: |- terraform { backend "s3" { region = "us-east-1" bucket = "my-terraform-state-bucket" key = "terraform-operator/my/awesome/example.tfstate" } } ignoreDelete: true # Create a tfvar env for the terraform to use taskOptions: - for: - '*' # The following config affects all task pods env: - name: TF_VAR_vpc_name value: blah - name: TF_VAR_client_id value: acme - name: TF_VAR_environment value: test envFrom: - secretRef: name: aws-session-credentials # temp creds for my bucket scmAuthMethods: - git: ssh: sshKeySecretRef: key: key name: gitsshkey namespace: default host: github.com keepCompletedPods: true keepLatestPodsOnly: true serviceAccount: tf-operator-service-account ``` The main module is simple: ```hcl # https://github.com/isaaguilar/simple-aws-tf-modules/blob/master/private-github-module/main.tf output "static" { value = "static" } module "private" { // This source is private source = "git@github.com:isaaguilar/terraform-do-something-awesome.git?ref=main" } terraform { required_version = "> 0.12" } ``` And indeed my logs match yours: ``` Connecting to raw.githubusercontent.com (185.199.111.133:443) saving to '/home/tfo-runner/generations/1/init.sh' init.sh 100% |********************************| 2762 0:00:00 ETA '/home/tfo-runner/generations/1/init.sh' saved Initializing modules... Downloading git@github.com:isaaguilar/terraform-do-something-awesome.git?ref=main for private... ╷ │ Error: Failed to download module │ │ Could not download module "private" (main.tf:7) source code from │ "git@github.com:isaaguilar/terraform-do-something-awesome.git?ref=main": │ error downloading │ 'ssh://git@github.com/isaaguilar/terraform-do-something-awesome.git?ref=main': │ /usr/bin/git exited with 128: Cloning into '.terraform/modules/private'... │ error: cannot run ssh: No such file or directory │ fatal: unable to fork │ ╵ ```

Any help determining why the arm build isn't working is greatly appreciated.

isaaguilar commented 2 years ago

Found the issue. In the arm build I don't have ssh installed.

~/generations/2/main$ ssh
bash: ssh: command not found

Fix should be relatively easy, but the tftask pods usually take a long time to build for all the versions. Hopefully I'll have them all updated for next week.

isaaguilar commented 2 years ago

Oh, but in the error you posted, it has to do with Bad owner or permissions on /home/tfo-runner/.ssh/config. Then I haven't replicated that issue yet. Ok still looking then. :(

isaaguilar commented 2 years ago

@odise I made some changes in the task scripts to attempt to fix the .ssh dir. I hope this fixes the issue you're having.

https://github.com/GalleyBytes/terraform-operator-tasks/pull/9

To make use of the changes, the following additions to the spec may be used:

spec:
  # ... 
  taskOptions:
  - for: [ init, plan, apply, init-delete, plan-delete, apply-delete ]
    script:
      source: https://raw.githubusercontent.com/GalleyBytes/terraform-operator-tasks/always-attempt-to-fix-ssh/tf.sh
  - for: [ setup ]
    script:
      source: https://raw.githubusercontent.com/GalleyBytes/terraform-operator-tasks/always-attempt-to-fix-ssh/setup.sh
odise commented 2 years ago

@isaaguilar this seems to fix the issue. Just to satisfy me curiosity: I think I tried to achieve exactly the same with a preinit step. Why didn't it took effect though?

isaaguilar commented 2 years ago

The preinit should have worked. Perhaps the chmod in the preinit used 660 in order to produce -rw-rw---- permissions:

-rw-rw----    1 tfo-runn 2000           399 Oct  7 12:40 github.com

and the fix uses 600 which produces -rw------- permissions:

-rw------- 1 tfo-runner 2000 1.7K Oct 10 20:17 /home/tfo-runner/.ssh/github.com