bigip_facts reporting URLErrors on some endpoints and not others

sochoa commented 9 years ago

I'm running with a manager-level user + 11.4 on the F5 load balancer, but its still giving me several URLErrors when I try to get bigip_facts (for some API methods, not for all).

Here's how I know the Big-IP version:

import bigsuds
b = bigsuds.BIGIP(hostname = 'f5', username = 'sochoa', password = '<redacted>')
print b.System.SystemInfo.get_version()

which reports BIG-IP_v11.4.1.

Here's a debug log (instrumented the python file dropped by ansible in-flight):

$ python bigip_facts
Main being called
Connecting to Big-IP
Connected!
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_action_on_service_down
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_active_member_count
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_aggregate_dynamic_ratio
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_allow_nat_state
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_allow_snat_state
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_client_ip_tos
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_client_link_qos
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_description
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_gateway_failsafe_device
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_ignore_persisted_weight_state
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_lb_method
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_member
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_minimum_active_member
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_minimum_up_member
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_minimum_up_member_action
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_minimum_up_member_enabled_state
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_monitor_association
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_monitor_instance
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_monitor_instance failed:  URLError: <urlopen error The read operation timed out>
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_object_status
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_object_status failed:  URLError: <urlopen error The read operation timed out>
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_profile
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_profile failed:  URLError: <urlopen error The read operation timed out>
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_queue_depth_limit
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_queue_depth_limit failed:  URLError: <urlopen error The read operation timed out>
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_queue_on_connection_limit_state
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_queue_on_connection_limit_state failed:  URLError: <urlopen error The read operation timed out>
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_queue_time_limit
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_queue_time_limit failed:  URLError: <urlopen error The read operation timed out>
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_reselect_tries
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_reselect_tries failed:  URLError: <urlopen error The read operation timed out>
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_server_ip_tos
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_server_ip_tos failed:  URLError: <urlopen error The read operation timed out>
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_server_link_qos
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_server_link_qos failed:  URLError: <urlopen error The read operation timed out>
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_simple_timeout
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_simple_timeout failed:  URLError: <urlopen error The read operation timed out>
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_slow_ramp_time
Generating dict for <__main__.Pools object at 0x9bde9cc>.get_slow_ramp_time failed:  URLError: <urlopen error The read operation timed out>

bcoca commented 9 years ago

cc @mhite

sochoa commented 9 years ago

As I looked at the instrumented run above in my original post, the thought occurred to me that maybe the issue is the soap client became invalid at some point after the get_monitor_association api call. So, I had it skip everything except get_monitor_instance:

def generate_dict(api_obj, fields):
    result_dict = {}
    lists = []
    supported_fields = []
    if api_obj.get_list():
        for field in fields:
            if field != 'monitor_instance':
              continue
            try:
                print "Generating dict for %s.%s" % (str(api_obj), "get_" + field)
                api_response = getattr(api_obj, "get_" + field)()
            except Exception, e:
                print "Generating dict for %s.%s failed:  %s" % (str(api_obj), "get_" + field, str(e))
                pass
            else:
                lists.append(api_response)
                supported_fields.append(field)
        for i, j in enumerate(api_obj.get_list()):
            temp = {}
            temp.update([(item[0], item[1][i]) for item in zip(supported_fields, lists)])
            result_dict[j] = temp
    return result_dict

.. and it still raised a URLError after timing out.

mhite commented 9 years ago

I think the root of the problem is that a user with the manager role doesn't have sufficient permissions to complete all of the fact gathering.

F5 describes the manager role as:

Grants users permission to create, modify, and delete virtual servers, pools, pool members, nodes, custom profiles, custom monitors, and iRules.

Would you mind trying with an administrator user account?

I think short-term we should document the privilege requirement for the account used during fact gathering. Long-term, we should take a look at how to gracefully handle this situation so a lower privileged user can at least gather some smaller set of facts.

sochoa commented 9 years ago

The documentation for the module says that it was tested with the manager account. :(

mhite commented 9 years ago

It is very likely that additional functionality was added which invalidated that statement, unfortunately. I'll take a closer look but in the meantime if you have the ability to test with an administrator account, we can at least see if this theory is on the right track.

sochoa commented 9 years ago

The admin is on vacation, so I won't be able to test in the short-term. Might be 1-2 weeks out.

mhite commented 9 years ago

Ok, I'll see what I can turn up here in a bit.

mhite commented 9 years ago

I've tested this module again with a manager level account and am unable to reproduce the problem. I'm testing with BIG-IP 11.5.1 Build 8.0.175 Hotfix HF8 running on a BIG-IP 2200. (I actually tested against a lab cluster and staging cluster.) I unfortunately don't have 11.4.1 at my disposal currently, but the module was indeed tested with 11.4 when it was first developed.

Here is my testing playbook:

# file bigip-test.yml
- hosts: localhost
  gather_facts: no
  tasks:
  - name: Collect BIG-IP facts
    local_action: >
      bigip_facts
      server=mylb.full.hostname.com
      user=username
      password=password
      include=address_class,certificate,client_ssl_profile,device,device_group,interface,key,node,pool,rule,self_ip,software,system_info,traffic_group,trunk,virtual_address,virtual_server,vlan

I am launching it via the following command:

ansible-playbook -i ./inv -vvv ../ansible-playbooks/f5/facts.yml

My inv file contains a single entry:

localhost

From what you put in the ticket, it seems as though every API call after get_monitor_instance seems to fail and it never recovers. This could indeed be some weird resource issue on F5 -- or even flakey network conditions.

Do you have other scripts or users that use the same credentials? This can cause problems, especially if people start logging in and changing partitions and such. Maybe try with session=true to see if that helps? (The help docs are wrong, session is off by default in the code.)

Anything potentially suspicious about your network path from your Ansible control host to your load balancer? IE. Excessive packet loss? VPN tunnels? Middle hops with smaller MTUs? Proxies?

Anything interesting in /var/log/ltm?

This is really the point where I bust out packet capturing tools on both ends.

sochoa commented 9 years ago

@mhite - Thanks for the test. I appreciate your effort there. I am operating from behind a bastion/jump host, but the bigsuds call is being performed from the bastion which has direct access to the F5 api endpoint. I'll follow-up on the LTM logs, but when I asked a couple days ago no one on my team could tell me where to find them. :) Thankfully the one who does have admin rights is back from vacation, so I might be able to get answers from him by weeks end.

Sorry for the lag in response time here. I'm up against a deadline on a different project.

mhite commented 9 years ago

It would be interesting if you recorded timestamps in your debug output. The underlying suds library uses a 90 second default timeout for urllib connections. Unfortunately the bigsuds module does not expose the timeout, but 90 seconds should hopefully be plenty of time for the API to return a response. Increasing the timeout (and making it a configurable option in the Ansible module) would mean monkey patching bigsuds. I don't want to do that if we can avoid it.

If you have other load balancers to try this against, that would also be useful data. (Even the backup member in an HA cluster is fine.)

All evidence seems to point to an underlying transport issue or a non-responsive iControl API.

You could hack your bigsuds module to increase the timeout if you are feeling adventurous.

https://devcentral.f5.com/questions/bigsuds-timeout

ansibot commented 8 years ago

@mhite, @caphrim007, ping. This issue is still waiting on your response. click here for bot help

mhite commented 8 years ago

Unfortunately we weren't able to reproduce and it doesn't look like @sochoa was able to follow up. Unless @sochoa has new information, we'll probably need to close this as 'unable to reproduce'.

ansibot commented 8 years ago

@mhite, @caphrim007, ping. This issue is still waiting on your response. click here for bot help

mhite commented 8 years ago

notabug

ansible / ansible-modules-extras

bigip_facts reporting URLErrors on some endpoints and not others #799