F5Networks / f5-aws-cloudformation

CloudFormation Templates for quickly deploying BIG-IP services in Amazon Web Services EC2
112 stars 118 forks source link

Public EIP Does not get moved to "Active" after fail-over #73

Closed azavrin-myvest closed 5 years ago

azavrin-myvest commented 5 years ago

Description

we deployed latest 4.2.0 Clustered BIG-IP VE - 3 NICs Across Availability Zones CF in AWS. Things are well, until we try to failover to standby unit, EIPs dont seem to re-assign onto standby box, LTM messages from "standby" appliance attached.

After fail-over occurs i can go in and manually re-assign EIPs to standby unit and traffic flows then.

May 29 20:39:15 ip-10-14-54-71 notice tmm1[9551]: 01340011:5: HA unit 1 state change: from 0 to 1. May 29 20:39:15 ip-10-14-54-71 notice -c [9551]: 01340011:5: HA unit 1 state change: from 0 to 1. May 29 20:39:16 ip-10-14-54-71.example.com info aws_advanced_failover[17856]: EIP takeover started. May 29 20:39:17 ip-10-14-54-71.example.com info aws_advanced_failover[17856]: Setting Environmental Variables. May 29 20:39:17 ip-10-14-54-71.example.com info aws_advanced_failover[17856]: Environmental Variables Set. May 29 20:39:17 ip-10-14-54-71.example.com notice logger[17869]: /usr/libexec/aws/aws-failover-tgactive.sh (traffic-group-1): Started. May 29 20:39:18 ip-10-14-54-71.example.com warning httpd[16542]: 0118000a:4: The Service Check Date check was skipped. May 29 20:39:19 ip-10-14-54-71.example.com info aws_advanced_failover[17856]: AWS secret and access key found, setting environment variables. May 29 20:39:25 ip-10-14-54-71.example.com info logger[17954]: /usr/libexec/aws/aws-failover-tgactive.sh (traffic-group-1): endpoint URL: https://ec2.us-west-2.amazonaws.com. May 29 20:39:25 ip-10-14-54-71.example.com warning httpd[13361]: 0118000a:4: The Service Check Date check was skipped. May 29 20:39:27 ip-10-14-54-71.example.com err aws_advanced_failover[17856]: sync-only device groups are not expected in the advanced failover. Check device group datasync-device-ip-10-14-4-25.example.com-dg May 29 20:39:27 ip-10-14-54-71.example.com err aws_advanced_failover[17856]: sync-only device groups are not expected in the advanced failover. Check device group datasync-device-ip-10-14-54-71.example.com-dg May 29 20:39:27 ip-10-14-54-71.example.com err aws_advanced_failover[17856]: sync-only device groups are not expected in the advanced failover. Check device group datasync-global-dg May 29 20:39:27 ip-10-14-54-71.example.com err aws_advanced_failover[17856]: sync-only device groups are not expected in the advanced failover. Check device group dos-global-dg May 29 20:39:28 ip-10-14-54-71.example.com err aws_advanced_failover[17856]: No virtual addresses with mask 255.255.255.255 are expected out of traffic group "none", when Elastic IP Mappings section is configured. Check virtual-address 10.14.0.100 May 29 20:39:28 ip-10-14-54-71.example.com err aws_advanced_failover[17856]: No virtual addresses with mask 255.255.255.255 are expected out of traffic group "none", when Elastic IP Mappings section is configured. Check virtual-address 10.14.0.101 May 29 20:39:28 ip-10-14-54-71.example.com err aws_advanced_failover[17856]: No virtual addresses with mask 255.255.255.255 are expected out of traffic group "none", when Elastic IP Mappings section is configured. Check virtual-address 10.14.0.102 May 29 20:39:28 ip-10-14-54-71.example.com err aws_advanced_failover[17856]: No virtual addresses with mask 255.255.255.255 are expected out of traffic group "none", when Elastic IP Mappings section is configured. Check virtual-address 10.14.50.100 May 29 20:39:28 ip-10-14-54-71.example.com err aws_advanced_failover[17856]: No virtual addresses with mask 255.255.255.255 are expected out of traffic group "none", when Elastic IP Mappings section is configured. Check virtual-address 10.14.50.101 May 29 20:39:28 ip-10-14-54-71.example.com err aws_advanced_failover[17856]: No virtual addresses with mask 255.255.255.255 are expected out of traffic group "none", when Elastic IP Mappings section is configured. Check virtual-address 10.14.50.102 May 29 20:39:30 ip-10-14-54-71.example.com err logger[18036]: /usr/libexec/aws/aws-failover-tgactive.sh (traffic-group-1): Encountered an error while parsing the network description: Unable to locate ENI for instance i-053e6cae6e5d7920f on subnet subnet-070df35514491d9ca. May 29 20:39:30 ip-10-14-54-71.example.com err logger[18038]: /usr/libexec/aws/aws-failover-tgactive.sh (traffic-group-1): Check the content and timestamp of the network description file /tmp/aws_failover_traffic-group-1 May 29 20:39:30 ip-10-14-54-71.example.com info logger[18040]: /usr/libexec/aws/aws-failover-tgactive.sh (traffic-group-1): IP address specified during HA takeover is not assigned to any interface. instance-id: i-053e6cae6e5d7920f. IP address: 10.14.0.100 May 29 20:39:31 ip-10-14-54-71.example.com err logger[18042]: /usr/libexec/aws/aws-failover-tgactive.sh (traffic-group-1): Encountered an error while parsing the network description: Unable to locate ENI for instance i-053e6cae6e5d7920f on subnet subnet-070df35514491d9ca. May 29 20:39:31 ip-10-14-54-71.example.com err logger[18044]: /usr/libexec/aws/aws-failover-tgactive.sh (traffic-group-1): Check the content and timestamp of the network description file /tmp/aws_failover_traffic-group-1 May 29 20:39:31 ip-10-14-54-71.example.com info logger[18045]: /usr/libexec/aws/aws-failover-tgactive.sh (traffic-group-1): IP address specified during HA takeover is not assigned to any interface. instance-id: i-053e6cae6e5d7920f. IP address: 10.14.0.101 May 29 20:39:31 ip-10-14-54-71.example.com err logger[18047]: /usr/libexec/aws/aws-failover-tgactive.sh (traffic-group-1): Encountered an error while parsing the network description: Unable to locate ENI for instance i-053e6cae6e5d7920f on subnet subnet-070df35514491d9ca. May 29 20:39:31 ip-10-14-54-71.example.com err logger[18049]: /usr/libexec/aws/aws-failover-tgactive.sh (traffic-group-1): Check the content and timestamp of the network description file /tmp/aws_failover_traffic-group-1 May 29 20:39:31 ip-10-14-54-71.example.com info logger[18050]: /usr/libexec/aws/aws-failover-tgactive.sh (traffic-group-1): IP address specified during HA takeover is not assigned to any interface. instance-id: i-053e6cae6e5d7920f. IP address: 10.14.0.102 May 29 20:39:31 ip-10-14-54-71.example.com warning httpd[13361]: 0118000a:4: The Service Check Date check was skipped. May 29 20:39:32 ip-10-14-54-71.example.com info aws_advanced_failover[17856]: Reassigned EIP 54.214.209.120 to VIP 10.14.50.102 on interface eni-0a5143ea9dc84c1b9 May 29 20:39:32 ip-10-14-54-71.example.com info aws_advanced_failover[17856]: No reconfiguration of AWS routes was requested. May 29 20:39:32 ip-10-14-54-71.example.com info aws_advanced_failover[17856]: EIP takeover completed. May 29 20:39:32 ip-10-14-54-71.example.com notice logger[18096]: /usr/libexec/aws/aws-failover-tgactive.sh (traffic-group-1): Completed. May 29 20:39:38 ip-10-14-54-71.example.com warning httpd[13361]: 0118000a:4: The Service Check Date check was skipped.

Template

We deployed 4.2.0 and v4.1.4 - similar behaviour

Severity Level

For bugs, enter the bug severity level. Do not set any labels.

Severity: <3>

mikeshimkus commented 5 years ago

Hi, you meant to say that the EIP doesn't get assigned to the active unit, correct? They should follow the active device.

The log indicates that one EIP did get reassigned:

May 29 20:39:32 ip-10-14-54-71.example.com info aws_advanced_failover[17856]: Reassigned EIP 54.214.209.120 to VIP 10.14.50.102 on interface eni-0a5143ea9dc84c1b9

Do you see this EIP on the correct interface?

azavrin-myvest commented 5 years ago

Hello,

yes, EIP doesn't get assigned to "active" unit. This how i test, i go to Active unit and force it to Standby, and i expect to see EIPs to move over, however when i look on the AWS console, after failover completes, EIPs remain on LTM -01.

As far as your question, yes, the message indicates proper assignment of IP to proper interface on -02 LTM.

Here are the logs from doing such failover just now: May 30 13:50:14 ip-10-14-54-71.example.com notice sod[4223]: 010c006d:5: Leaving Standby for Active: Next Active, peers agree on config. May 30 13:50:14 ip-10-14-54-71.example.com notice sod[4223]: 010c0053:5: Active for traffic group traffic-group-1. May 30 13:50:14 ip-10-14-54-71.example.com notice sod[4223]: 010c0019:5: Active May 30 13:50:14 ip-10-14-54-71 notice -c [9551]: 01340011:5: HA unit 1 state change: from 0 to 1. May 30 13:50:14 ip-10-14-54-71 notice tmm1[9551]: 01340011:5: HA unit 1 state change: from 0 to 1. May 30 13:50:14 ip-10-14-54-71.example.com info aws_advanced_failover[6999]: EIP takeover started. May 30 13:50:15 ip-10-14-54-71.example.com notice logger[7010]: /usr/libexec/aws/aws-failover-tgactive.sh (traffic-group-1): Started. May 30 13:50:15 ip-10-14-54-71.example.com info aws_advanced_failover[6999]: columns: EIP 2, AZ1 3, AZ2 4 May 30 13:50:15 ip-10-14-54-71.example.com info aws_advanced_failover[6999]: IPs from iApp: 34.223.8.188 10.14.0.179 10.14.50.112 May 30 13:50:15 ip-10-14-54-71.example.com info aws_advanced_failover[6999]: instanceId is i-053e6cae6e5d7920f May 30 13:50:15 ip-10-14-54-71.example.com info aws_advanced_failover[6999]: region is us-west-2 May 30 13:50:15 ip-10-14-54-71.example.com info aws_advanced_failover[6999]: Setting Environmental Variables. May 30 13:50:15 ip-10-14-54-71.example.com info aws_advanced_failover[6999]: Environmental Variables Set. May 30 13:50:16 ip-10-14-54-71.example.com info aws_advanced_failover[6999]: AWS secret and access key found, setting environment variables. May 30 13:50:17 ip-10-14-54-71.example.com warning httpd[19448]: 0118000a:4: The Service Check Date check was skipped. May 30 13:50:19 ip-10-14-54-71.example.com info logger[7090]: /usr/libexec/aws/aws-failover-tgactive.sh (traffic-group-1): endpoint URL: https://ec2.us-west-2.amazonaws.com. May 30 13:50:19 ip-10-14-54-71.example.com info aws_advanced_failover[6999]: networkDescriptionCache is at /tmp/tmp3r8ouw May 30 13:50:22 ip-10-14-54-71.example.com err aws_advanced_failover[6999]: sync-only device groups are not expected in the advanced failover. Check device group datasync-device-ip-10-14-4-25.example.com-dg May 30 13:50:22 ip-10-14-54-71.example.com err aws_advanced_failover[6999]: sync-only device groups are not expected in the advanced failover. Check device group datasync-device-ip-10-14-54-71.example.com-dg May 30 13:50:22 ip-10-14-54-71.example.com err aws_advanced_failover[6999]: sync-only device groups are not expected in the advanced failover. Check device group datasync-global-dg May 30 13:50:22 ip-10-14-54-71.example.com err aws_advanced_failover[6999]: sync-only device groups are not expected in the advanced failover. Check device group dos-global-dg May 30 13:50:23 ip-10-14-54-71.example.com err aws_advanced_failover[6999]: No virtual addresses with mask 255.255.255.255 are expected out of traffic group "none", when Elastic IP Mappings section is configured. Check virtual-address 10.14.0.100 May 30 13:50:23 ip-10-14-54-71.example.com err aws_advanced_failover[6999]: No virtual addresses with mask 255.255.255.255 are expected out of traffic group "none", when Elastic IP Mappings section is configured. Check virtual-address 10.14.0.101 May 30 13:50:23 ip-10-14-54-71.example.com err aws_advanced_failover[6999]: No virtual addresses with mask 255.255.255.255 are expected out of traffic group "none", when Elastic IP Mappings section is configured. Check virtual-address 10.14.0.102 May 30 13:50:23 ip-10-14-54-71.example.com err aws_advanced_failover[6999]: No virtual addresses with mask 255.255.255.255 are expected out of traffic group "none", when Elastic IP Mappings section is configured. Check virtual-address 10.14.50.100 May 30 13:50:23 ip-10-14-54-71.example.com err aws_advanced_failover[6999]: No virtual addresses with mask 255.255.255.255 are expected out of traffic group "none", when Elastic IP Mappings section is configured. Check virtual-address 10.14.50.101 May 30 13:50:23 ip-10-14-54-71.example.com err aws_advanced_failover[6999]: No virtual addresses with mask 255.255.255.255 are expected out of traffic group "none", when Elastic IP Mappings section is configured. Check virtual-address 10.14.50.102 May 30 13:50:23 ip-10-14-54-71.example.com info aws_advanced_failover[6999]: size of EIP mapping array is 1 May 30 13:50:23 ip-10-14-54-71.example.com info aws_advanced_failover[6999]: considering row 0, EIP 34.223.8.188, VIP1 10.14.0.179, VIP2 10.14.50.112 May 30 13:50:24 ip-10-14-54-71.example.com warning httpd[10806]: 0118000a:4: The Service Check Date check was skipped. May 30 13:50:24 ip-10-14-54-71.example.com info aws_advanced_failover[6999]: EIP 34.223.8.188, chosen AZ 2 May 30 13:50:24 ip-10-14-54-71.example.com info aws_advanced_failover[6999]: reassigning EIP 34.223.8.188, row 0 May 30 13:50:24 ip-10-14-54-71.example.com info aws_advanced_failover[6999]: VIP is 10.14.50.112 May 30 13:50:24 ip-10-14-54-71.example.com info aws_advanced_failover[6999]: ENI is eni-0a5143ea9dc84c1b9 May 30 13:50:26 ip-10-14-54-71.example.com info aws_advanced_failover[6999]: addressDescription is ADDRESSES eipalloc-09ae5493fe5adbbb3 eipassoc-06c282dbebe327e73 vpc i-0042047da3928d9bf eni-0842e9c049fbfc499 643798970730 10.14.0.179 34.223.8.188 May 30 13:50:26 ip-10-14-54-71.example.com info aws_advanced_failover[6999]: EIPAllocationId for EIP 34.223.8.188 is eipalloc-09ae5493fe5adbbb3 May 30 13:50:27 ip-10-14-54-71.example.com err logger[7195]: /usr/libexec/aws/aws-failover-tgactive.sh (traffic-group-1): Encountered an error while parsing the network description: Unable to locate ENI for instance i-053e6cae6e5d7920f on subnet subnet-070df35514491d9ca. May 30 13:50:27 ip-10-14-54-71.example.com err logger[7197]: /usr/libexec/aws/aws-failover-tgactive.sh (traffic-group-1): Check the content and timestamp of the network description file /tmp/aws_failover_traffic-group-1 May 30 13:50:27 ip-10-14-54-71.example.com info logger[7198]: /usr/libexec/aws/aws-failover-tgactive.sh (traffic-group-1): IP address specified during HA takeover is not assigned to any interface. instance-id: i-053e6cae6e5d7920f. IP address: 10.14.0.100 May 30 13:50:27 ip-10-14-54-71.example.com err logger[7200]: /usr/libexec/aws/aws-failover-tgactive.sh (traffic-group-1): Encountered an error while parsing the network description: Unable to locate ENI for instance i-053e6cae6e5d7920f on subnet subnet-070df35514491d9ca. May 30 13:50:27 ip-10-14-54-71.example.com err logger[7202]: /usr/libexec/aws/aws-failover-tgactive.sh (traffic-group-1): Check the content and timestamp of the network description file /tmp/aws_failover_traffic-group-1 May 30 13:50:27 ip-10-14-54-71.example.com info logger[7203]: /usr/libexec/aws/aws-failover-tgactive.sh (traffic-group-1): IP address specified during HA takeover is not assigned to any interface. instance-id: i-053e6cae6e5d7920f. IP address: 10.14.0.101 May 30 13:50:27 ip-10-14-54-71.example.com err logger[7206]: /usr/libexec/aws/aws-failover-tgactive.sh (traffic-group-1): Encountered an error while parsing the network description: Unable to locate ENI for instance i-053e6cae6e5d7920f on subnet subnet-070df35514491d9ca. May 30 13:50:27 ip-10-14-54-71.example.com err logger[7208]: /usr/libexec/aws/aws-failover-tgactive.sh (traffic-group-1): Check the content and timestamp of the network description file /tmp/aws_failover_traffic-group-1 May 30 13:50:27 ip-10-14-54-71.example.com info logger[7209]: /usr/libexec/aws/aws-failover-tgactive.sh (traffic-group-1): IP address specified during HA takeover is not assigned to any interface. instance-id: i-053e6cae6e5d7920f. IP address: 10.14.0.102 May 30 13:50:28 ip-10-14-54-71.example.com info aws_advanced_failover[6999]: ec2AssociateAddress { May 30 13:50:28 ip-10-14-54-71.example.com info aws_advanced_failover[6999]: Reassigned EIP 34.223.8.188 to VIP 10.14.50.112 on interface eni-0a5143ea9dc84c1b9 May 30 13:50:28 ip-10-14-54-71.example.com info aws_advanced_failover[6999]: No reconfiguration of AWS routes was requested. May 30 13:50:28 ip-10-14-54-71.example.com info aws_advanced_failover[6999]: EIP takeover completed. May 30 13:50:29 ip-10-14-54-71.example.com notice logger[7248]: /usr/libexec/aws/aws-failover-tgactive.sh (traffic-group-1): Completed.

mikeshimkus commented 5 years ago

Per the solution deployment guide: https://www.f5.com/pdf/deployment-guides/f5-aws-ha-dg.pdf, you can ignore the errors from /usr/libexec/aws/aws-failover-tgactive.sh.

I'm guessing the virtual addresses in the errors from aws_advanced_failover (10.14.0.100, etc) are the ones that aren't moving. Are those virtual addresses configured in traffic-group-1 or traffic group "none"?

azavrin-myvest commented 5 years ago

Yes, i saw the article. traffic-group-1 contains the following failover objects, there is no traffic group "none":

` Name Address Type Partition / Path
10.14.0.100 10.14.0.100 Virtual Address Common
10.14.0.101 10.14.0.101 Virtual Address Common
10.14.0.102 10.14.0.102 Virtual Address Common
10.14.50.100 10.14.50.100 Virtual Address Common
10.14.50.101 10.14.50.101 Virtual Address Common
10.14.50.102 10.14.50.102 Virtual Address Common
HA_Across_AZs   Application Instance Common/HA_Across_AZs.app
test.example.com-ha   Application Instance Common/test.example.com-ha.app
Passports-AZ   Application Instance Common/Passports-AZ.app
1test.example.com-ha   Application Instance Common/1test.example.com-ha.app

`

10.14.0.x are internal/private IPs, where 10.14.0.x belong to AZ1 and 10.14.50.x belong to AZ2 10.14.0.x assigned to LTM-1 in AZ1 10.14.50.x assigned to LTM-2 in AZ2

Private IPs dont move around during fail-over, that's why pools contain members from both zones

only IPs get moved are Public EIPs, so if i have on active LTM-01 the following config: 10.14.0.100 with EIP 3.54.32.90 (made up IP)

After failover i expect to see, on LTM-02 in AZ2: 10.14.50.100 with EIP 3.54.32.90

However, 3.54.32.90 doesnt move over and remains assigned on LTM-01 (which now has become standby)

mikeshimkus commented 5 years ago

Right, the failover script uses the mappings of EIP to a pair of internal virtual servers for AZ1/AZ2 to determine who is the active device for that EIP. So the internal virtual addresses should be in traffic group "none". From the logs it looks like it is not expecting to see them in traffic-group-1.

I assume you have a mapping for 3.54.32.90 configured in the iApp?

azavrin-myvest commented 5 years ago

hmm, yes, iApp is all configured and happy in that regard.

If the script does not expect any failover objects to be in traffic-group-1, then should i go to VS, and assign each VS to Traffic Group none versus traffic-group-1 (floating) ?

mikeshimkus commented 5 years ago

Yes, assign those virtual addresses to "None".

azavrin-myvest commented 5 years ago

setting virtual addresses to group none did the trick - thanks much for your guidance!