f5devcentral / f5-cloud-failover-extension

F5 Cloud Failover Extension (Archived)
Apache License 2.0
5 stars 1 forks source link

400 error on any POST to CFE #19

Closed rtorosian closed 4 years ago

rtorosian commented 4 years ago

Recently deployed a pair of VEs to Azure and am trying to get CFE working. Running marketplace 14.1.2 code and have tried both 1.1 and 1.0 versions of CFE. All POSTs fail with a non-descriptive 400 and most GETs do too, only GET that responds with a 200 is /info. Local logs in silly mode aren't helpful either. Here's an example POST payload and it's response and logs:

{ "class": "Cloud_Failover", "environment": "azure", "externalStorage": { "scopingTags": { "f5_cloud_failover_label": "failover" } }, "failoverAddresses": { "scopingTags": { "f5_cloud_failover_label": "failover" } }, "failoverRoutes": { "scopingTags": { "f5_cloud_failover_label": "failover" }, "scopingAddressRanges": [ { "range": "10.107.4.0/22" } ], "defaultNextHopAddresses": { "discoveryType": "static", "items": [ "10.107.4.4", "10.107.4.5" ] } }, "controls": { "class": "Controls", "logLevel": "silly" } }

{ "code": 400, "message": "remoteSender:10.255.18.73, method:POST ", "originalRequestBody": "{\n \"class\": \"Cloud_Failover\",\n \"environment\": \"azure\",\n \"externalStorage\": {\n \"scopingTags\": {\n \"f5_cloud_failover_label\": \"failover\"\n }\n },\n \"failoverAddresses\": {\n \"scopingTags\": {\n \"f5_cloud_failover_label\": \"failover\"\n }\n },\n \"failoverRoutes\": {\n \"scopingTags\": {\n \"f5_cloud_failover_label\": \"failover\"\n },\n \"scopingAddressRanges\": [\n {\n \"range\": \"10.107.4.0/22\"\n }\n ],\n \"defaultNextHopAddresses\": {\n \"discoveryType\": \"static\",\n \"items\": [\n \"10.107.4.4\",\n \"10.107.4.5\"\n ]\n }\n },\n \"controls\": {\n \"class\": \"Controls\",\n \"logLevel\": \"silly\"\n }\n}", "referer": "10.255.18.73", "restOperationId": 7477090, "kind": ":resterrorresponse" }

Tue, 17 Mar 2020 15:53:15 GMT - info: [f5-cloud-failover] Global logLevel set to 'silly' Tue, 17 Mar 2020 15:53:15 GMT - finest: [f5-cloud-failover] Modifying existing data group f5-cloud-failover-state with body {"name":"f5-cloud-failover-state","type":"string","records":[{"name":"state","data":"eyJjb25maWciOnsiY2xhc3MiOiJDbG91ZF9GYWlsb3ZlciIsImVudmlyb25tZW50IjoiYXp1cmUiLCJleHRlcm5hbFN0b3JhZ2UiOnsic2NvcGluZ1RhZ3MiOnsiZjVfY2xvdWRfZmFpbG92ZXJfbGFiZWwiOiJmYWlsb3ZlciJ9fSwiZmFpbG92ZXJBZGRyZXNzZXMiOnsic2NvcGluZ1RhZ3MiOnsiZjVfY2xvdWRfZmFpbG92ZXJfbGFiZWwiOiJmYWlsb3ZlciJ9fSwiZmFpbG92ZXJSb3V0ZXMiOnsic2NvcGluZ1RhZ3MiOnsiZjVfY2xvdWRfZmFpbG92ZXJfbGFiZWwiOiJmYWlsb3ZlciJ9LCJzY29waW5nQWRkcmVzc1JhbmdlcyI6W3sicmFuZ2UiOiIxMC4xMDcuNC4wLzIyIn1dLCJkZWZhdWx0TmV4dEhvcEFkZHJlc3NlcyI6eyJkaXNjb3ZlcnlUeXBlIjoic3RhdGljIiwiaXRlbXMiOlsiMTAuMTA3LjQuNCIsIjEwLjEwNy40LjUiXX19LCJjb250cm9scyI6eyJjbGFzcyI6IkNvbnRyb2xzIiwibG9nTGV2ZWwiOiJzaWxseSJ9LCJzY2hlbWFWZXJzaW9uIjoiMS4xLjAifX0="}]} Tue, 17 Mar 2020 15:53:18 GMT - info: [f5-cloud-failover] Successfully wrote Failover trigger scripts to filesystem Tue, 17 Mar 2020 15:53:18 GMT - fine: [f5-cloud-failover] Initializing failover class Tue, 17 Mar 2020 15:53:18 GMT - fine: [f5-cloud-failover] config: {"class":"Cloud_Failover","environment":"azure","externalStorage":{"scopingTags":{"f5_cloud_failover_label":"failover"}},"failoverAddresses":{"scopingTags":{"f5_cloud_failover_label":"failover"}},"failoverRoutes":{"scopingTags":{"f5_cloud_failover_label":"failover"},"scopingAddressRanges":[{"range":"10.107.4.0/22"}],"defaultNextHopAddresses":{"discoveryType":"static","items":["10.107.4.4","10.107.4.5"]}},"controls":{"class":"Controls","logLevel":"silly"},"schemaVersion":"1.1.0"} Tue, 17 Mar 2020 15:53:18 GMT - finest: [f5-cloud-failover] Telemetry submitted successfully Tue, 17 Mar 2020 15:54:16 GMT - fine: [f5-cloud-failover] HTTP Request - POST /declare

I've tried a bunch of different POST payloads using all the examples I can find, all work the same. One odd thing is the POSTs seem to re-try, I'll see them many times in the logs well after I sent the initial POST (and closed out Postman or my shell session).

rtorosian commented 4 years ago

Just destroyed and reapplied using marketplace 15.0 code and it's basically the same experience, save for now a GET on /declare actually works.

C0missar commented 4 years ago

I was about to open the same exact issue when I found this, but I have it in AWS. TMOS 14.1.2.0, CFT 5.3.0 HA across AZs BYOB, and CFE 1.1.0. It doesn't matter whether I post the declaration via Ansible or PostMan. I also see the retries after having closed the PostMan session, making me thing that the error involves a POST from one API to the other on-box.

I wonder if it might be related to not having disabled the tgactive.sh script, because I have a hacked 5.3.0 CFT that removes tgactive.sh, and that script does not exhibit the 400 errors. I have other issues though, and F5 support won't work with me on a modified CFT.

FWIW, the other issue I have on this same stock CFT is that APM cores on first boot, but on the B device only, every time. I have a ticket open on that one, #C3231030. Despite the ticket title, support won't touch CFE, and is only looking at the APM core problem.

shyawnkarim commented 4 years ago

I just deployed this template and everything is working as expected. I'm able to get info, post, and failover.

Can I get some additional details on your deployments so that I can try and duplicate the behavior?

rtorosian commented 4 years ago

I'm using out-of-the-box which does the /trigger URL in tgactive.sh.

rtorosian commented 4 years ago

I based my deployment on https://github.com/JeffGiroux/f5_terraform/tree/master/HA_via_api but updated it for Terraform azurerm 2.0. failover.json had to be updated for defaultNextHopAddresses (I've tried both static and routeTag).

C0missar commented 4 years ago

Rather than hijack rtorosian's thread, which seems to be Azure-centric anyway, I moved my comments/rant over to issue #17.

JeffGiroux commented 4 years ago

My repo that I forked from a colleague https://github.com/JeffGiroux/f5_terraform/tree/master/HA_via_api is currently using the pre 1.0 CFE RPM (v0.9.1) as seen here - https://github.com/JeffGiroux/f5_terraform/blob/ab4ceb4c754d507d4db96b30cbb75ac53e30d61f/HA_via_api/variables.tf#L68. That 0.9.1 CFE release did not automatically comment or modify the tgactive.sh. I would suggest that if you are using this repo that you make the appropriate updates to the declaration as you noted (ex failover.json). Some of those files were updated a few months ago, so they need some attention. I'm doing that now in my own repo.

And to note...the .9 release of CFE did NOT comment anything or modify the tgactive.sh file. The 1.0 and 1.1 release of CFE does. But...not if you are updating from .9. So...make sure to manually remove the .9 or any version you updated to (if you didn't remove .9 first), then re-install. Then /reset state file (CFE docs have this).

If you want to deploy an ARM template created by F5 PM/PD team, then we have plenty of working examples that use the current/latest versions of the cloud extensions. For example, latest release includes the CFE for failover in Azure along with AWS and Google. https://github.com/F5Networks/f5-azure-arm-templates/releases/tag/v7.4.0.0

Permissions: It's also important to make sure your permissions (managed identify/roles are correct and that your firewall rules allow access to make API calls to Azure REST API endpoint. Those pre-reqs are listed in the cloud failover docs as well as the github ARM template docs (and other cloud provider F5 template docs).

rtorosian commented 4 years ago

@JeffGiroux I started from scratch with your repo with all the latest RPMs and made a bunch of tweaks here and there to get things to build cleanly with Terraform. tgactive or any of the scripts in /config/failover won't come into play yet for me as yeah they make the same API calls I'm attempting manually. /reset is yet another API call that results in a 400. As for permissions yes I have those all correct to allow a call to Azure, but right now I'm running into issues just calling CFE itself.

JeffGiroux commented 4 years ago

If /reset isn't working, then something else is happening with the endpoint (restnode stuck maybe?). I ran into some weird issues with my terraform repo which kept causing other unrelated issues. I created a dev branch to work through issues, updated providers, updated code and docs, but still no luck. The cluster forms, RPM packages and declarations get loaded, but weird inconsistent results with failover extension. So...I'm blaming my onboarding scripts or something in the terraform resources.

**On a side note, my terraform repo is community supported (self created) whereas the GitHub cloud templates are created and supported by F5 product development staff as well as community. F5 currently does not have a productized Terraform repo/examples yet. This ticket is not meant to troubleshoot my terraform repo bucket, so instead please focus on testing failover extension only and using a known-good working BIG-IP setup.

Options to test Failover Extension: One... The latest 7.4 template versions will deploy a BIG-IP pair with HA via API using the cloud failover extension. The tags are set already and the declaration POSTs successfully.

Two... Spin up 2 standalone BIG-IP devices using ARM templates. This way, you can manually go through the HA pairing and also the cloud failover extension to see how it works. Plus, you'll be starting on a known-good BIG-IP install.

Three... You can deploy manually 2 instances using the Azure portal (the web GUI). A pain, but another option in case you want to try this way.

Use one of the above options to get a known-good working BIG-IP deployment, and then test failover extension. Report back here with the results to share if you're getting 400 errors consistently anymore.

JeffGiroux commented 4 years ago

I just deployed failover via HA with 3nic template - https://github.com/F5Networks/f5-azure-arm-templates/tree/master/supported/failover/same-net/via-api/n-nic/new-stack/payg. Cloud failover extensions works great. I do not get any slowness compared to when I run the deployment through my terraform repo nor do I receive any weird 400 issues after onboarding.

Some items to note for difference:

  1. The terraform repo uses declarative onboarding (DO) extension for onboarding the L1-L3 stuff and HA pairing. The ARM templates do not use DO. The ARM templates use f5-cloud-libs for onboarding.

  2. When posting my DO declaration (using my dev branch of my repo), I think I'm running into this issue with DO regarding DHCP on mgmt NIC - https://github.com/F5Networks/f5-declarative-onboarding/issues/129. I have a guess this is also causing problems with cloud failover extension since HA never properly forms. I have to manually re-run DO and then I have to fix CFE with /reset and re-POSTing, but then the endpoints hang, I get 4xx...and so on.

rtorosian commented 4 years ago

@JeffGiroux Thanks for all the extra input. I'll try another ARM later today to see how it's doing DO. Ultimately I have do succeed with Terraform though.

So today I've tried disabling mgmt-dhcp right before the DO stuff runs and it hasn't appeared to help. However I am getting an actual descriptive 400 now if I manually try to re-apply the config with a /declare:

{ "code": 400, "declaration": { "async": false, "class": "Cloud_Failover", "controls": { "class": "Controls", "logLevel": "silly" }, "environment": "azure", "externalStorage": { "scopingTags": { "f5_cloud_failover_label": "failover" } }, "failoverAddresses": { "scopingTags": { "f5_cloud_failover_label": "failover" } }, "failoverRoutes": { "defaultNextHopAddresses": { "discoveryType": "routeTag" }, "scopingAddressRanges": [ { "range": "10.107.4.0/23" } ], "scopingTags": { "f5_cloud_failover_label": "failover" } } }, "errors": [ { "dataPath": ".declaration", "keyword": "additionalProperties", "message": "should NOT have additional properties", "params": { "additionalProperty": "environment" }, "schemaPath": "#/additionalProperties" } ], "id": "b12d0487-bcbe-442b-b98f-886207c3d9f8", "message": "bad declaration", "result": { "class": "Result", "code": 400, "errors": [ { "dataPath": ".declaration", "keyword": "additionalProperties", "message": "should NOT have additional properties", "params": { "additionalProperty": "environment" }, "schemaPath": "#/additionalProperties" } ], "message": "bad declaration", "status": "ERROR" }, "selfLink": "https://localhost/mgmt/shared/declarative-onboarding/task/b12d0487-bcbe-442b-b98f-886207c3d9f8", "status": "ERROR" }

Looks like it's complaining about 'environment'?!

rtorosian commented 4 years ago

Something else weird I noticed, here's the response from an /info:

{ "version": "1.1.0", "release": "0", "schemaCurrent": "0.9.1", "schemaMinimum": "1.1.0" }

The API reference shows the Current & Minimum being reversed...

JeffGiroux commented 4 years ago

Regarding /info and showing different schemaCurrent and Minimum values, I ran into this yesterday and informed the team. If you run GET /info again, the current and min values swap each time you run it. Weird.

To possibly fix your current 400 error, try doing a /reset so you can reset state file, then rePOST the declaration to both BIG-IP units again. You can find the various API calls in the postman collection here - https://clouddocs.f5.com/products/extensions/f5-cloud-failover/latest/userguide/postman-collection.html.

Name = "Reset failover state file" method = POST endpoint = "{{baseUrl}}/reset" payload = { "resetStateFile": true }

You'll get a 200 OK saying that the state file has been reset. If you do a GET {{baseUrl}}/trigger, then it will also show state file has been reset. If you see that, then proceed to rePOST the CFE declaration. Then attempt failover or other GET calls. For example...after successful failover (and assumed successful CFE setup and BIGIP setup), the /trigger endpoint GET will respond with all the associate and dissociate cloud object tasks it will perform (move IPs, routes, etc).

So...try to reset state file. Then re-POST declaration. Then re-test.

rtorosian commented 4 years ago

I've been using Postman for a lot of the testing. /reset always gives me a 400 no matter what :(

JeffGiroux commented 4 years ago

If /reset gives you 400 all the time, then I would do some manual fixes to your current deployment to continue troubleshooting.

  1. Leave BIG-IPs deployed...no need for a new deployment job
  2. Manually log into GUI, go to iApps packages and find the RPM package for cloud failover extension.
  3. Delete the RPM package on both BIG-IP units
  4. Manually upload the latest 1.1.0 CFE RPM package in the GUI to both BIG-IP units
  5. Do another POST to reset state file to both BIG-IP units
  6. POST declaration to both BIG-IP units
  7. Test api endpoints and/or failover

If still receiving 400 errors, then I would imagine the PM/PD team will need a dump of your logs.

shyawnkarim commented 4 years ago

Internal bug ID AUTOSDK-234 has been created to look into this.

rtorosian commented 4 years ago

So after lots of remove RPM/reboot/add RPM/reboot combos, it's even worse. Both devices show tons of iApp subsystem errors and such, half the time don't show CFE installed, and now return a 404 for every API call :-/. I noticed this behavior earlier in the week too on a previous iteration of F5 apply/destroy builds, so it's not this exact VE pair doing it for me.

JeffGiroux commented 4 years ago

My quick troubleshooting update: I deployed a brand new Azure environment using Azure CLI only. Trying to rule out any onboarding weirdness and terraform potential weirdness. My BIG-IPs were created with az vm create along with all other related Azure objects and corresponding az cli command. That is my base environment.

Testing:

  1. Above description deploys two marketplace latest images in standalone
  2. Walk through GUI wizard initial setup on each F5 manually
  3. Setup name, vlan, self IP, routes, etc
  4. Create one VIP listening on the secondary IP (my example 10.0.2.10). This should match the secondary private IP on the VM NIC in Azure.
  5. Manually paired HA using trust peer add, device group create (sync-fail), auto sync
  6. Validated that HA is now formed.
  7. Validated that my VIP is reachable on public IP associated with 10.0.2.10

At this point, I did not use the Declarative Onboarding RPM. Everything above regarding HA setup and initial BIG-IP setup was me...manual.

Next...failover testing

  1. Installed latest CFE 1.1.0 to both BIG-IPs
  2. Installed declaration to both BIG-IPs
  3. Ran Postman calls to validate no weirdness (ex no 4xx errors). Checked various endpoints like GET /declare, info, inspect. All ran successfully and no issues.
  4. Forced manual failover
  5. CFE GET /trigger showed RUNNING status, showed which objects would move (good)
  6. Failover validated, objects moved
  7. Ran various Postman calls (a lot) to test weirdness (ex slowness, 4xx errors) but found nothing
  8. Tested failback
  9. Tested lots of postman calls again. No issues

At this point, deploying an ARM template works great. Also, manually (or via az cli) deploying 2 BIG-IPs and then manually forming HA, then deploying CFE as detailed in CFE Quick Start works great.

My next test is to deploy a new cluster using my fancy (or not so fancy) az cli script, 2 new standalone BIG-IPs, then use Declarative Onboarding for the L1-L3 plus HA forming. I testing whether or not DO is conflicting with CFE. That's the point of this latter test. Stay tuned...

JeffGiroux commented 4 years ago

OK, final update maybe... I can definitely point to the combination of running DO plus CFE where things start behaving strange and inconsistently. Read my previous replies...CFE behaves perfectly well with an ARM template deployment (f5-cloud-libs onboarding and HA pairing), and CFE works great when I spin up 2 BIG-IPs manually in the marketplace and walk through the GUI.

My last test was to spin up a new cluster again, this time do onboarding with DO. So...

  1. Launch 2 big-ip instances in azure marketplace
  2. login to ssh and set azureuser password. Also disable gui-setup
  3. upload DO RPM to both units
  4. push DO declaration to both units
  5. HA forms successfully first time, no errors

**Also, I should note that my previous testing of DO was possibly using an ill-formatted json file. I grabbed an example from clouddocs quick start on the DO readme and all works well with a DO push now...no weird DHCP error.

At this point, DO works great. No weird errors. Using DO latest 1.11.0. Time to test cloud failover now. Remember, now I have a cluster running DO. Prior two tests used ARM template and manual HA.

  1. Install CFE 1.1.0 RPM to both boxes to the DO cluster
  2. POST json CFE declaration to both boxes
  3. Checked a few api endpoints like /info, then tried GET /trigger

First...got this...

Sat, 21 Mar 2020 21:28:47 GMT - severe: [f5-cloud-failover] failover.execute() error: Cannot read property 'disassociate' of undefined TypeError: Cannot read property 'disassociate' of undefined

  1. Reset state file on each BIG-IP
  2. GET /trigger now shows that state file reset
  3. Force failover via the GUI on active F5
  4. Check various endpoints like GET /trigger again

I immediately started seeing weirdness. Really slow response from api endpoint. Eventually timed out responses like below. Doing a GET to /trigger resulted in this after waiting forever. { "code": 400, "message": "remoteSender:x.x.x.x, method:GET ", "referer": "x.x.x.x", "restOperationId": 6569608, "kind": ":resterrorresponse" }

In all of my testing, the combination of DO 1.11.0 plus CFE 1.1.0 is strange and not working as expected. The cloud objects on the plus side do indeed failover. Routes failover. IPs failover. Things appear to work. But the API and hence restnode might have conflicts with DO+CFE. Just a guess. I have qkviews if needed.

rtorosian commented 4 years ago

Have had time to play with this again. Thanks for all the testing @JeffGiroux! I've done a bunch more F5 spinups and tried uninstalling the DO packages after initial deployment leaving just CFE installed...and yeah no change in behavior. What hooks could DO still have in the system?

rtorosian commented 4 years ago

Just tried the latest DO release, 1.11, and yeah same overall behavior :(

rtorosian commented 4 years ago

Looks like a similar issue is being tracked over in DO: https://github.com/F5Networks/f5-declarative-onboarding/issues/100

rtorosian commented 4 years ago

So as an update I've been working with @JeffGiroux offline and we were able to get CFE to play somewhat nicely with DO, no more 400s! See his post 4 up from here. What enabled the failover/resets to finally work however is a mystery however as I didn't change anything drastic in my code for it to finally work. My guess is having the IPs initially attached to the NICs of the B-side F5 (which always came up initially as Active) might be it. I've attached my git commit if anyone is interested on the changes that got it working:

7f77de2d12918d48ee4a6891ee17520523b66f76.diff.txt

shyawnkarim commented 4 years ago

This issue has been addressed and will be included in our next release =~ April 16, 2020.

alaari-f5 commented 4 years ago

fixed in CFE 1.2. Pls see https://github.com/f5networks/f5-cloud-failover-extension