IBM / packer-plugin-ibmcloud

Automate customizing Image Template
Apache License 2.0
24 stars 24 forks source link

Invalid error handling after timeout #39

Closed jowko closed 3 years ago

jowko commented 3 years ago

I tried to build more advanced image using ansible which requires a lot of time to complete.

Environment: Ubuntu 20.04.2 LTS (Focal Fossa) installed on WSL2 in Windows 10 Pro 19042. I have the newest version of IBM Cloud plugin.

My settings:

source "ibmcloud-vpc" "centos" {
  api_key = "${var.ibm_api_key}"
  region  = "eu-de"

  subnet_id = "my-subnet"
  resource_group_id = "my-resource"

  vsi_base_image_name = "ibm-centos-7-9-minimal-amd64-3"

  vsi_profile   = "bx2-8x32"
  vsi_interface = "public"

  image_name = "${var.product_id}-${local.timestamp}"

  communicator = "ssh"
  ssh_username = "root"
  ssh_port     = 22
  ssh_timeout  = "15m"

  timeout = "30m"
}

build {
  sources = [
    "source.amazon-ebs.amazon",
    "source.ibmcloud-vpc.centos"
  ]

  provisioner "shell-local" {
    environment_vars = ["INSTANCE_IP=${build.Host}"]
    inline           = ["echo $${INSTANCE_IP}", "echo '${build.SSHPrivateKey}'  > /tmp/packer-session.pem"]
  }

  provisioner "ansible" {
    playbook_file = "./playbook.yml"
    # verbose logging
    extra_arguments = ["-vvvv"]
  }

}

I run:

packer build -only=ibmcloud-vpc.centos packer.pkr.hcl

It took about 45 minutes to execute all steps from Ansible playbook. Then VM in IBM Cloud was stopped and custom image appeared in state pending. IBM Cloud plugin waited for custom image:

==> ibmcloud-vpc.centos: Waiting for the Image to become AVAILABLE...
==> ibmcloud-vpc.centos: .
==> ibmcloud-vpc.centos: .
==> ibmcloud-vpc.centos: .
==> ibmcloud-vpc.centos: .
==> ibmcloud-vpc.centos: .
==> ibmcloud-vpc.centos: Waiting time: 1 minutes
...
==> ibmcloud-vpc.centos: Waiting time: 14 minutes
Build 'ibmcloud-vpc.centos' errored after 1 hour 6 seconds: unexpected EOF

There is no crash log available. Also resources such like stopped VM instance or ssh keys are still there, so there was no cleanup phase. @juanpinzon I suspect that ssh_timeout option could cause this error because it was set to 15 minutes as in examples. Custom image became Available about 50 minutes after import which is not surprising since it has more than 40 GB.

Is such behaviour expected? Also, could we improve logging and error handling here?

juanpinzon commented 3 years ago

Could you please follow the example provided in the examples folder. There is a file called build.vpc.centos-ansible.pkr.hcl which you can follow as template to run Ansible. I have run 5 tests today and did not replicate your issue nor have any other problem.

ssh_timeout is used by packer as the time to wait for SSH to become available. https://www.packer.io/docs/communicators/ssh#ssh_timeout. So it is another thing.

timeout is the one you need to adjust according with your requirements. This is something each user have to set. Remember, we only provide a template but each scenario is different and need to be customized by the user.

About the logging, it is something handle directly by HashiCorp Packer. Packer Plugin developers cannot control logging, we only add new components to Packer, such as builders. https://www.packer.io/docs/plugins.

We handle the errors the best we could, however we understand there is always room to improve. Thus, since this is an open source project, any contribution is appreciated and welcome.

jowko commented 3 years ago

I run examples before and they work correctly. But the problem with these examples is that they are very simple and finish in a few minutes. I tried to build VM with real application inside and it takes much more time than these examples. I increased ssh_timeout and timeout to 90 minutes, but the same error occurs after 60 minutes:

19:13:07  ==> ibmcloud-vpc.centos: Waiting time: 24 minutes
19:13:17  ==> ibmcloud-vpc.centos: .
19:13:30  ==> ibmcloud-vpc.centos: .
19:13:40  ==> ibmcloud-vpc.centos: .
19:13:50  ==> ibmcloud-vpc.centos: .
19:13:50  Build 'ibmcloud-vpc.centos' errored after 1 hour 4 seconds: unexpected EOF

So it looks like there is some thing in this plugin or packer itself, which causes error exactly after 1h.

juanpinzon commented 3 years ago

Hi @jowko, could you please share your playbook, so I can test it on my end. Thanks

jowko commented 3 years ago

I will share part of my playbook, but I cannot share it fully because there are steps where we install our application using docker images (about 9GB of images from private repositories with code of our application). This phase is the most time consuming, so you need to add some additional time consuming steps to reach more than 1 hour of build time (maybe download a lot of stuff from github or dockerhub?). ansible.zip

juanpinzon commented 3 years ago

Hi @jowko I run some tests and I did not get any errors.

I think the problem is related with the SSH_keys from issue #38. Please check your packerlog/packerlog.txt and you will see that even though the playbook seems to be running, there is no SSH connection between Ansible and the VSI. When you increase ssh_timeout you only increase the time for this connection to become available, however it never will happen, and eventually the connection will throw an error. Without SSH connection Ansible is

Please be aware that in IBM infra SSH Keys MUST be located/config at

export PRIVATE_KEY="$HOME/.ssh/id_rsa"
export PUBLIC_KEY="$HOME/.ssh/id_rsa.pub"

So, at your .env file set this variables as stated above.

Likewise, please follow the sample to config Ansible on a Packer Template

  provisioner "ansible" {
    playbook_file = "provisioner/centos-playbook.yml"
    inventory_file = "${var.ansible_inventory_file}"
    ssh_host_key_file = "${var.private_key_file}"
    ssh_authorized_key_file = "${var.public_key_file}"
    extra_arguments = [
      "-vvvv",
      "--extra-vars",
      "ansible_user=root --private-key=${var.private_key_file}"
    ]
  }

At this point I am not able to replicate your error.

Please adjust your end with the information I am providing you and run packer one more time.

Best Regards, Juan

jowko commented 3 years ago

Hi @juanpinzon Your playbook run for more than 1 hour? Because my example is not sufficient as I explained in previous comments, because the most time consuming parts of our playbook relates to our internal infrastructure which will not work outside our network.

Please be aware that in IBM infra SSH Keys MUST be located/config at

As I stated in this task and in #38 it is not true in my case.

I performed tests using build.vpc.centos-ansible.pkr.hcl. I removed almost all code from ansible provisioner except playbook_file. SSH env variables were not pointing to the location you described but to the empty files in the same directory. Custom image is built successfully. When I run this custom image in my cloud and connect to it using ssh, I verified that both nginx binaries and service is present (it is installed in example playbook).

When I originally run my own built, ansible executed correctly all steps, so there was no problem with ssh connection. Here is full log from CI/CD build: packer.txt After some time custom image was available and VM instance created from it was fully working and contained changes made by Ansible. I run such build a few times and it always ends unexpectedly after 1 hour - even if previous steps executed a little faster. So it suggest that there is some timeout for total build time and not for specific step.

When I configured it as you suggested above in my linux CI environment:

  provisioner "ansible" {
    playbook_file           = "./playbook.yml"
    inventory_file          = "${var.ansible_inventory_file}"
    ssh_host_key_file       = "${var.private_key_file}"
    ssh_authorized_key_file = "${var.public_key_file}"
    extra_arguments = [
      "-vvvv",
      "--extra-vars",
      "ansible_user=root --private-key=${var.private_key_file}"
    ]
  }

with envs:

        PRIVATE_KEY="$HOME/.ssh/id_rsa"
        PUBLIC_KEY="$HOME/.ssh/id_rsa.pub"
        ANSIBLE_INVENTORY_FILE="provisioner/hosts"
        PACKER_LOG="1"
        PACKER_LOG_PATH="packerlog.txt"

I got errors related to ssh connection: errors.txt

jowko commented 3 years ago

I run again my own build (same as in first post, without additional environment variables listed by you) but with adding extra PACKER_LOG and PACKER_LOG_PATH environment variables. In packerlog there is log for this error:

packer-plugin-ibmcloud_v2.0.1_x5.0_linux_amd64 plugin: 2021/08/24 11:57:06 Checking resource state... (attempt: 144)
packer-plugin-ibmcloud_v2.0.1_x5.0_linux_amd64 plugin: 2021/08/24 11:57:06 Response Status -  401
packer-plugin-ibmcloud_v2.0.1_x5.0_linux_amd64 plugin: 2021/08/24 11:57:06 Response Body from ibmcloud-  {"errors":[{"code":"token_invalid","message":"The service token was expired or invalid. (Token is expired)","more_info":"http://cloud.ibm.com/docs/vpc-on-classic?topic=vpc-on-classic-rias-error-messages#token_invalid","target":{"name":"","type":""}}],"trace":"e6537865577993fbbd8fef93130d38e1"}
packer-plugin-ibmcloud_v2.0.1_x5.0_linux_amd64 plugin: panic: interface conversion: interface {} is nil, not string

So probably service token needs to be automatically refreshed in long builds because if expires exactly in one hour. Full log containing both data printed to console and from packerlog: full-log.txt

jowko commented 3 years ago

I was able to prepare artificial playbook which recreates this error. playbook.zip It just downloads a lot of different stuff which takes time and increases the size of result image. If your build will last less than 1 hour, just add additional heavy entries to download. Full logs: artificial-build-error.txt Of course, after these errors, nothing is cleaned from IBM, so VM, SG, ssh keys etc needs to be removed manually.

juanpinzon commented 3 years ago

@jowko Good catch, definitely every day I learn something new. Thanks for finding the bug. Could you please share to me the exact packer Template you are using (without additional environment variables listed by me), so I can recreate same environment. Thanks

jowko commented 3 years ago

I executed both packers in Jenkins pipeline which runs on linux agents. Original packer files: original.zip This will not reproduce issue since I removed a lot of operations which will not work outside of our infrastructure. But this can show that the locations of ssh keys don't need to be in the location provided by you as I described in issue #38

Second, artificial packer data was prepared by me to show that this issue is reproducible: artificial.zip You can run it on own IBM Cloud. But if network in your cluster will be faster and build finishes below 1 hour, just add another download to playbook to extend it's build time.

I replaced my ibm resources ids with placeholders.

juanpinzon commented 3 years ago

@jowko check new release v2.0.2 that solves your issue

jowko commented 3 years ago

@juanpinzon I verified that this works correctly. Thanks.