Open DahlPatric opened 8 months ago
We have the same issue. It appeared on version 16.1.4.2 with CFE 2.0.2. We made many tests with F5 support* and reverted CFE to version 1.15 (which still works). We're waiting for an update. Our F5 maintener managed the case with F5 I don't have the ticket number
*for instance, we tried to modify these DB variable (default value 60) as they thought it was related to https://my.f5.com/manage/s/article/K000136003
(F5-AZURE-02)(cfg-sync In Sync)(Standby)(/Common)(tmos)# modify sys db icrd.timeout value 300
(F5-AZURE-02)(cfg-sync In Sync)(Standby)(/Common)(tmos)# modify sys db restjavad.timeout value 300
(F5-AZURE-02)(cfg-sync In Sync)(Standby)(/Common)(tmos)# modify sys db restnoded.timeout value 300
EDIT : note this issue breaks CFE completly, declare won't work anymore, failover tests (dry-run) and real failovers won't work anymore too
Confirming version BIG-IP 16.1.4.2 Build 0.0.3 Point Release 2 either 1,5 nor 2.0 CFE seams to work. When requesting any data from CFE API endpoint we see [f5-cloud-failover] Status: Error getting instance metadata connect ECONNREFUSED 169.254.169.254:80 in logs, if related or not we not sure.
@DahlPatric You need to verify you can query the Azure metadata service from both devices: https://learn.microsoft.com/en-us/azure/virtual-machines/instance-metadata-service?tabs=linux#access-azure-instance-metadata-service
This is a prerequisite of CFE.
@CTV-2023 Sounds like you have a different issue since v1.15.0 works. Can you open a new GitHub issue and provide the case info?
@mikeshimkus it might not be needed, it could be related to the storage account. F5 support provided me this information
The engineering team have reviewed the data and requested if you can login to azure portal and check under storage account 'XXXXX', do they have a container named 'f5cloudfailover' and also make sure this is the first one on the list of containers. If you can share the screenshot of this. The reason it should be first on the list is due to the fact that the code of CFE 2.0.2, it looks for the first container and if the f5cloudfailover is not first then we will get an error. If this is not the first container then if you can make sure it is moved up the list and test again.
And in our case we have "boot diagnostics" containers, which might have been created by Azure when we deployed the VMs 2 years ago. I don't know how I can move the container, Azure doesn't seem to provide this option, but I might be able to delete the 2 other containers and try again (waiting for an update)
Maybe @DahlPatric can check his Azure config too
Aha, yes in fact you should have a dedicated storage account for CFE. We don't call that out in the documentation but I will add a task to do just that.
@mikeshimkus it might not be needed, it could be related to the storage account. F5 support provided me this information
The engineering team have reviewed the data and requested if you can login to azure portal and check under storage account 'XXXXX', do they have a container named 'f5cloudfailover' and also make sure this is the first one on the list of containers. If you can share the screenshot of this. The reason it should be first on the list is due to the fact that the code of CFE 2.0.2, it looks for the first container and if the f5cloudfailover is not first then we will get an error. If this is not the first container then if you can make sure it is moved up the list and test again.
And in our case we have "boot diagnostics" containers, which might have been created by Azure when we deployed the VMs 2 years ago. I don't know how I can move the container, Azure doesn't seem to provide this option, but I might be able to delete the 2 other containers and try again (waiting for an update)
Maybe @DahlPatric can check his Azure config too
Yes the Storage Account (stcfe) that ARM template created have a container called f5cloudfailover.
Changed value to below
"scopingName": "f5cloudfailover",
Pushed configuration again but same "500 AsyncContext timeout" error on CFE 1.5. Doesn't work on CFE 2.0 either but a different error:
{
"message": "Error getting instance metadata undefined -> Also see cloud docs link for more help: https://clouddocs.f5.com/products/extensions/f5-cloud-failover/latest/userguide/troubleshooting.html"
}
Executed /mgmt/shared/cloud-failover/inspect
{
"message": "Failover initialization failed: Error getting instance metadata undefined"
}
Could it be combination with version 16.1.4.2 that cause this issues and not actually CFE?
@DahlPatric The scoping name needs to match the name of the storage account, not the container.
We have tested CFE with 16.1.4.2 in the US regions, so it is not a problem with the VE version. Can you the Azure instance metadata service from the VE (outside of CFE): curl -s -H Metadata:true --noproxy "*" "http://169.254.169.254/metadata/instance?api-version=2021-02-01" | jq
I suspect you cannot, so you need to resolve that before CFE can work.
@DahlPatric The scoping name needs to match the name of the storage account, not the container.
We have tested CFE with 16.1.4.2 in the US regions, so it is not a problem with the VE version. Can you the Azure instance metadata service from the VE (outside of CFE): curl -s -H Metadata:true --noproxy "*" "http://169.254.169.254/metadata/instance?api-version=2021-02-01" | jq
I suspect you cannot, so you need to resolve that before CFE can work.
@mikeshimkus, when you say from VE what do you mean? A assume its not from F5 CLI?
Using curl from the shell on the BIG-IP instances.
Using curl from the shell on the BIG-IP instances.
No response from IP. I could see traffic exist from SelfIP.
01:57:43.919215 IP (tos 0x0, ttl 64, id 52923, offset 0, flags [DF], proto TCP (6), length 60) 10.45.140.11.13507 > 169.254.169.254.http: Flags [S], cksum 0xea63 (incorrect -> 0x4740), seq 3185618879, win 29200, options [mss 1460,sackOK,TS val 408480110 ecr 0,nop,wscale 7], length 0 out slot1/tmm0 lis= port=1.1 trunk= 01:57:44.921287 IP (tos 0x0, ttl 64, id 52924, offset 0, flags [DF], proto TCP (6), length 60) 10.45.140.11.13507 > 169.254.169.254.http: Flags [S], cksum 0xea63 (incorrect -> 0x4356), seq 3185618879, win 29200, options [mss 1460,sackOK,TS val 408481112 ecr 0,nop,wscale 7], length 0 out slot1/tmm0 lis= port=1.1 trunk=
Proxy reset to default as I have configured.
tmsh modify sys db proxy.host reset-to-default
Found this KB https://my.f5.com/manage/s/article/K000137268 could that be solution?
That KB article is for AS3, but your issue appears to be between the instance and Azure metadata service. I would verify that you have nothing on the BIG-IP blocking access to both 169.254.169.254 and 168.63.129.16. If you don't, then you need to contact Azure to troubleshoot why you can't connect.
. Can you the Azure instance metadata service from the VE (outside of CFE): curl -s -H Metadata:true --noproxy "*" "http://169.254.169.254/metadata/instance?api-version=2021-02-01" | jq
We have verified from another VE that we have IP is accessible.
This roles out that there is any issues itself inside Azure. What else can we do to find issues?
So it works on one VE, but not the other? This could still be an issue with Azure depending on security group or route configuration for that particular instance.
Description
Issues configuring CFE version 2.0
Environment information
Severity Level
For bugs, enter the bug severity level. Do not set any labels.
Severity: 3
Error message configuring
Code
Configured previous working version 1.5 and it's start working.