Deploy of ES 7.8.1 cluster in AWS cluster bootstrap fails and other issues

AAber commented 4 years ago

Hi guys, I've been using your work for a few years now and it works well for me, so thanks for the good job. I'm testing the deploy of ES latest 7.8.1 in AWS, 3 masters, 2 clients, 3 datas. The bootstrap of the cluster runs in an endless loop and never completes. Below I attached the log files of the bootstrap server, and one of the master nodes. I noticed that packer packs an AMI that is based on Ubuntu image that is meant for EKS, is this intentional? This image has Kubelet and other K8s stuff on it and it doesn't have vi, it would be nice if you added vi to the image :) I also posted the user-data.log and elaticsearch.yml files from the bootstrap server and one of the master servers. In the user-data files I saw a few errors that seem like bugs.

Any tips on how to debug the bootstrap process are welcome. Let me know if you want me to post any other files to help debug this issue.

Thanks,

Asher

boot_elasticsearch_yml.txt boot_st7-es-cluster.log boot_user-data.log

master_elasticsearch_yml.txt master_st7-es-cluster.log master_user-data.log

tomer-yoskovich commented 4 years ago

Hi Asher,

I couldn't reproduce it in a similar deployment Can you please attach your terraform.tfvars (+ variables.tf if you've made changes to the default values)

AAber commented 4 years ago

Hi Tomer, Attached my variables.tf. variables-tf.txt I can't find a file named terraform.tfvars in my setup, is this file mandatory? where is it located?

Thx.

Asher

AAber commented 3 years ago

I posted the log of only one of the masters the other masters have similar logs

AAber commented 3 years ago

Update: I pulled the latest code and created a cluster. In my case I needed to test AWS Graviton (ARM64) instances instead of the older (X86_64) instance types. I also configured a Network load balancer instead of the expensive ALB. The script rendering in broken so I pushed an updated autoattach-disk.sh script to the elastic-search AMI I launches a new cluster and the cluster works. I would be happy to work together to apply the changes to the upstream.

Thanks.

Bamieh commented 3 years ago

@AAber can you share the updated autoattach-disk.sh? A comment in this issue should be enough for those facing the same issue.

Some EC2 instances have different disks configuration so the script only works for an array of instance types (might be cool if we have a matrix of supported instance types in the future).

In my case I reverted to using c5 instances since they work with the current autoattach-disk.sh script

AAber commented 3 years ago

The main issue with the autoattach-disk.sh script was that when the root disk and storage disk are both of Nvme type the line:

export DEVICE_NAME=$(lsblk -ip | tail -n +2 | awk '{print $1 " " ($7? "MOUNTEDPART" : "") }' | sed ':a;N;$!ba;s/\n`/ /g' | grep -v MOUNTEDPART)

Would catch two device names instead of the one storage device name. The script expects a single device name of the storage device. In my script I added "tail -1" to the line to single out the storage disk device:

export DEVICE_NAME=$(lsblk -ip | tail -n +2 | awk '{print $1 " " ($7? "MOUNTEDPART" : "") }' | sed ':a;N;$!ba;s/\n`/ /g' | grep -v MOUNTEDPART|sort|tail -1)

This is a copy of my script: autoattach-disk-sh.txt

BigDataBoutique / elasticsearch-cloud-deploy

Deploy of ES 7.8.1 cluster in AWS cluster bootstrap fails and other issues #94