Closed sochoa closed 8 years ago
cc @mhite
As I looked at the instrumented run above in my original post, the thought occurred to me that maybe the issue is the soap client became invalid at some point after the get_monitor_association
api call. So, I had it skip everything except get_monitor_instance
:
def generate_dict(api_obj, fields):
result_dict = {}
lists = []
supported_fields = []
if api_obj.get_list():
for field in fields:
if field != 'monitor_instance':
continue
try:
print "Generating dict for %s.%s" % (str(api_obj), "get_" + field)
api_response = getattr(api_obj, "get_" + field)()
except Exception, e:
print "Generating dict for %s.%s failed: %s" % (str(api_obj), "get_" + field, str(e))
pass
else:
lists.append(api_response)
supported_fields.append(field)
for i, j in enumerate(api_obj.get_list()):
temp = {}
temp.update([(item[0], item[1][i]) for item in zip(supported_fields, lists)])
result_dict[j] = temp
return result_dict
.. and it still raised a URLError
after timing out.
I think the root of the problem is that a user with the manager role doesn't have sufficient permissions to complete all of the fact gathering.
F5 describes the manager role as:
Grants users permission to create, modify, and delete virtual servers, pools, pool members, nodes, custom profiles, custom monitors, and iRules.
Would you mind trying with an administrator user account?
I think short-term we should document the privilege requirement for the account used during fact gathering. Long-term, we should take a look at how to gracefully handle this situation so a lower privileged user can at least gather some smaller set of facts.
The documentation for the module says that it was tested with the manager account. :(
It is very likely that additional functionality was added which invalidated that statement, unfortunately. I'll take a closer look but in the meantime if you have the ability to test with an administrator account, we can at least see if this theory is on the right track.
The admin is on vacation, so I won't be able to test in the short-term. Might be 1-2 weeks out.
Ok, I'll see what I can turn up here in a bit.
I've tested this module again with a manager level account and am unable to reproduce the problem. I'm testing with BIG-IP 11.5.1 Build 8.0.175 Hotfix HF8 running on a BIG-IP 2200. (I actually tested against a lab cluster and staging cluster.) I unfortunately don't have 11.4.1 at my disposal currently, but the module was indeed tested with 11.4 when it was first developed.
Here is my testing playbook:
# file bigip-test.yml
- hosts: localhost
gather_facts: no
tasks:
- name: Collect BIG-IP facts
local_action: >
bigip_facts
server=mylb.full.hostname.com
user=username
password=password
include=address_class,certificate,client_ssl_profile,device,device_group,interface,key,node,pool,rule,self_ip,software,system_info,traffic_group,trunk,virtual_address,virtual_server,vlan
I am launching it via the following command:
ansible-playbook -i ./inv -vvv ../ansible-playbooks/f5/facts.yml
My inv file contains a single entry:
localhost
From what you put in the ticket, it seems as though every API call after get_monitor_instance seems to fail and it never recovers. This could indeed be some weird resource issue on F5 -- or even flakey network conditions.
Do you have other scripts or users that use the same credentials? This can cause problems, especially if people start logging in and changing partitions and such. Maybe try with session=true to see if that helps? (The help docs are wrong, session is off by default in the code.)
Anything potentially suspicious about your network path from your Ansible control host to your load balancer? IE. Excessive packet loss? VPN tunnels? Middle hops with smaller MTUs? Proxies?
Anything interesting in /var/log/ltm?
This is really the point where I bust out packet capturing tools on both ends.
@mhite - Thanks for the test. I appreciate your effort there. I am operating from behind a bastion/jump host, but the bigsuds call is being performed from the bastion which has direct access to the F5 api endpoint. I'll follow-up on the LTM logs, but when I asked a couple days ago no one on my team could tell me where to find them. :) Thankfully the one who does have admin rights is back from vacation, so I might be able to get answers from him by weeks end.
Sorry for the lag in response time here. I'm up against a deadline on a different project.
It would be interesting if you recorded timestamps in your debug output. The underlying suds library uses a 90 second default timeout for urllib connections. Unfortunately the bigsuds module does not expose the timeout, but 90 seconds should hopefully be plenty of time for the API to return a response. Increasing the timeout (and making it a configurable option in the Ansible module) would mean monkey patching bigsuds. I don't want to do that if we can avoid it.
If you have other load balancers to try this against, that would also be useful data. (Even the backup member in an HA cluster is fine.)
All evidence seems to point to an underlying transport issue or a non-responsive iControl API.
You could hack your bigsuds module to increase the timeout if you are feeling adventurous.
@mhite, @caphrim007, ping. This issue is still waiting on your response. click here for bot help
Unfortunately we weren't able to reproduce and it doesn't look like @sochoa was able to follow up. Unless @sochoa has new information, we'll probably need to close this as 'unable to reproduce'.
@mhite, @caphrim007, ping. This issue is still waiting on your response. click here for bot help
notabug
I'm running with a manager-level user + 11.4 on the F5 load balancer, but its still giving me several URLErrors when I try to get bigip_facts (for some API methods, not for all).
Here's how I know the Big-IP version:
which reports
BIG-IP_v11.4.1
.Here's a debug log (instrumented the python file dropped by ansible in-flight):