null_resource.etcd_secrets: Still creating...

rothgar commented 6 years ago

What keywords did you search in tectonic-installer issues before filing this one?

If you have found any duplicates, you should instead reply there and close this page.

If you have not found any duplicates, delete this section and continue on.

Is this a BUG REPORT or FEATURE REQUEST?

Choose one: BUG REPORT

Versions

Tectonic version: 1.7.5-tectonic.1 (1cccbee)
Terraform version (terraform version):
Platform (aws|azure|openstack|metal): metal

What happened?

Going through provisioning steps the final steps shows the message: null_resource.etcd_secrets: Still creating... It has been repeating for 20 minutes and never appears to finish

What you expected to happen?

Provisioning steps would continue.

How to reproduce it (as minimally and precisely as possible)?

so far I've gone through the installer twice with the same results.

Anything else we need to know?

I'm trying to provision 1 master and 1 worker

log file attached tectonic-home.log

coresolve commented 6 years ago

hey @rothgar. With the bare metal install we get to this step when terraform is attempting to communicate with the masters.

During this step we copy in the tls assets needed to form the tls secured etcd cluster. Since these are secrets we don't place them on matchbox. We expect terraform to be able to scp the secrets to the masters directly using keys in your ssh-agent.

Usually if this is timing out at this stage it's either because the machines themselves didn't boot successfully or that the ssh key you specified in the terraform.tfvars file is not loaded in your agent.

To validate the ssh-agent you can

ssh-agent -L

rothgar commented 6 years ago

I hadn't provisioned the master yet. Is it expected the master has pxe booted and waiting to install or should it already have CoreOS installed? I may have missed where that was stated in the installer.

coresolve commented 6 years ago

As part of the provisioning process we configure matchbox which has what's needed to install and configure container Linux on your servers. More info here:

https://coreos.com/tectonic/docs/latest/install/bare-metal/

ahjohannessen commented 6 years ago

Got same problem, matchbox / dnsmasq work and I did ssh-add -L. Seems like something is borked with remote.tf as /etc/ssl/etcd is never created on master and kubeconfig is never copied to worker.

mcamen commented 6 years ago

Same here. Any chance I can get more logs to see what is going on? ./tectonic-installer/linux/installer -log-level debug -logtostderr does not give any hints.

ahjohannessen commented 6 years ago

@mcamen Logs from terraform do not report any error. Seems like there is some condition on count or similar that prevents those assets being copied via ssh - code around that seems to be defined in Remote.tf.

mcamen commented 6 years ago

Ok, at least for me everything works fine now. There has been a configuration issue with the DHCP server which prevented a successful CoreOS installation. Sorry for the noise. :(

mwthink commented 6 years ago

@mcamen What was your DHCP issue? I'm experiencing this same issue running the tectonic_1.8.4-tectonic.2 installer for Darwin (OSX).

Matchbox logs indicate machines booting and installing successfully. On their initial boot, I'm unable to ssh to the controller via ssh core@controller-1.mydomain.local. I connect fine, but it's asking for a password rather than using the SSH key for some reason. null_resource.etcd_secrets is just sitting at "Still creating... (27m50s elapsed)"

What platform are ya'll experiencing this on? Would like to know if it's an issue with Darwin and/or Linux. Looking through some of the terraform files to better diagnose the issue.

mcamen commented 6 years ago

@mwthink As far as i can remember the DHCP returned the wrong search domain and DNS servers. In the end the /etc/resolv.conf of the CoreOS boxes was wrong.

levensailor commented 6 years ago

I'm having this issue. 5th try.

I have dnsmasq and matchbook running on the same server and i have the installer running on my local macbook. I generated the keys using the script and passed them in the installer gui. I created a public key, attached it to ssh-agent and I can ssh to the controller with ssh core@controller.mydomain.com with no password needed.

can someone explain exactly what the installer is doing ? i'm thinking its using my ssh public key to securely SCP to the controller and drop in the private keys from the earlier step. is that correct? can i manually scp the keys over somehow? do i need to keep recreating the VMs to PXE boot or can I leave them as is to pick up where the installer left off?

bsctl commented 6 years ago

@levensailor @mwthink I'm having the same issue here. Both for Darwin (Tectonic GUI) and CentOS Linux (Terraform CLI). Any advice to solve it? Thanks

ashak commented 5 years ago

A bit late to the party, but i've just run into this too.

For me, the issue was that the machine on which I was running the installer was not talking to dnsmasq for DNS, therefore it couldn't resolve my hosts to ssh to them to complete the work. I hacked them into /etc/hosts to test and a few seconds later the installer continued.

I was running the installer on my laptop. The rest of the machines are within their own subnet which my laptop can communicate with but wasn't on directly.

coreos / tectonic-installer