alchemy-charmers / layer-haproxy

GNU General Public License v3.0
3 stars 1 forks source link

Haproxy 0.1.5 (r3) backtraces on fresh install #8

Open digitalrane opened 5 years ago

digitalrane commented 5 years ago

Haproxy charm, in a fresh deployment, relying solely on default configuration besides setting the 'version' to '1.9' (tested previously, and this worked) backtraced on install -

The unit currently has no relations - so I'm wondering if initial deployment with no relations is a problem after the work we've done to support multiple ports or similar. Needs further debugging.

Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-haproxy-0/.venv/lib/python3.5/site-packages/charms/reactive/__init__.py", line 73, in main                                                       
    bus.dispatch(restricted=restricted_mode)
  File "/var/lib/juju/agents/unit-haproxy-0/.venv/lib/python3.5/site-packages/charms/reactive/bus.py", line 390, in dispatch                                                       
    _invoke(other_handlers)
  File "/var/lib/juju/agents/unit-haproxy-0/.venv/lib/python3.5/site-packages/charms/reactive/bus.py", line 359, in _invoke                                                        
    handler.invoke()
  File "/var/lib/juju/agents/unit-haproxy-0/.venv/lib/python3.5/site-packages/charms/reactive/bus.py", line 181, in invoke                                                         
    self._action(*args)
  File "/var/lib/juju/agents/unit-haproxy-0/charm/reactive/haproxy.py", line 50, in configure_haproxy                                                                              
    ph.add_timeout_tunnel()
  File "lib/libhaproxy.py", line 36, in add_timeout_tunnel
    defaults = self.proxy_config.defaults[0]
IndexError: list index out of range
digitalrane commented 5 years ago

This also happens without the '1.9' version set. The problem appears to be that 'defaults' is not defined, given that the config at that point has no default section. Additional logic might be required to detect and correct. The haproxy configuration at this point of configuration only has the stats and default port 80 frontends, and the redirect backend.

frontend stats 
    bind 0.0.0.0:9000 
    stats enable 
    stats auth admin:admin 
    stats uri /ha-stats 
    acl local src 10.0.0.0/8 192.168.0.0/16 127.0.0.0/8
    http-request deny if !local 

frontend relation-80 
    bind 0.0.0.0:80 
    default_backend redirect  

backend redirect
    redirect scheme https 
    server redirect 127.0.0.1:0 

I've also just noticed now that we're missing 172.x networks from our local ACL.

digitalrane commented 5 years ago

This is an example where parsing application config in a charm has caused problems whereas templating the entire config would have avoided this issue, by the way :stuck_out_tongue:

I'm digging into this and with no defaults section defined, there is no clear way to coax pyhaproxy into creating one. I'm going to keep poking at it.

chris-sanders commented 5 years ago

I'm installing it now to see, it's been a while, but my recollection is that the default section is created when HAProxy is installed and the charm only appends 1 setting to it otherwise accepting the package defaults.

So I'm curious, was the file modified after install, or was 1.9 set during install and then switched back to cause the error (that's not supported)? Understanding what's causing this will help inform how to fix it. While we can obviously write a default section I'm curious to know why it's missing in the first place.

chris-sanders commented 5 years ago

I've just run the test for 0.1.5 and it passes xenial/bionic. Do you have a bundle that reproduces this?

digitalrane commented 5 years ago

The following bundle reproduces this issue for me, with either 1.9 or 1.7 set. I'm not able to reproduce this on a local LXD provider, but when I deployed on a server using bridged networking (which I mention just for completeness' sake) I got the above backtrace.

series: bionic
applications:
  haproxy:
    charm: cs:~pirate-charmers/haproxy
    num_units: 1
    series: xenial
    expose: true
    options:
      version: '1.9'

My process was as such -

The defaults section looks like this after deploy:

defaults           
    log global                        
    mode http                    
    option httplog                          
    option dontlognull                
    timeout connect 5000            
    timeout client 50000       
    timeout server 50000                  
    errorfile 400 /etc/haproxy/errors/400.http
    errorfile 403 /etc/haproxy/errors/403.http
    errorfile 408 /etc/haproxy/errors/408.http
    errorfile 500 /etc/haproxy/errors/500.http
    errorfile 502 /etc/haproxy/errors/502.http
    errorfile 503 /etc/haproxy/errors/503.http
    errorfile 504 /etc/haproxy/errors/504.http
    timeout tunnel 1h

I think I wrote the title of the PR thinking I fully grokked the problem, but you're right, there are many options that come from the default config file being installed rather than the charm configure hook.

I think in retrospect, what is happening, and what this addresses (and what I initially intended it to address, too) was adding needed safety around accessing the first index of the defaults configuration without testing if the defaults section actually had anything in it. Looking at this more closely, I think there is more likely a race condition that means we are parsing the config before there is a defaults section, or we're trying to access the defaults section before the config has been parsed.

I agree that clobbering the config is the wrong approach, but I'm not 100% sure that's what is actually happening with this approach. We definitely don't want to clobber the config, but I also don't want to throw errors and fail the hook if we can handle it gracefully, which this code appears to from my testing.

EDIT: removed my local charm hack because my charmstore access is still busted :-1:

chris-sanders commented 5 years ago

If you're talking about the fix in #9 I'm curious, when you use that do you still get the full defaults section? If this is a race, when you hit that error you've read in a file with no defaults and will write back over top of it. Whatever was suppose to create that section is likely to then come along and either overwrite the one we've configured or bail b/c it didn't expect to find a file there.

This must be a race as you describe since it's changing based on where it's deployed.

digitalrane commented 5 years ago

The defaults section I provided above was from a charm which was built to include #9 as the only delta from the current master. So, it fixed the issue and didn't seem to create any issues for me. I think potentially it's unsafe accessing the [0]th index of the default section when the config file might not yet exist, or the section might not be there, and I believe as soon as the config file does come in, things work as expected.

chris-sanders commented 5 years ago

I agree, accessing it before the file exists is an issue. Accessing the '0' default should be safe if it's not we're in an unexpected state. I think we should find the race condition because I can't explain how you would both hit this bug and get all the defaults. I'm concerned when re-testing you're simply not hitting the race condition, can you put in a log so we can confirm it's exercising the check?

digitalrane commented 5 years ago

I actually think the first access is probably a no-op, because all it's checking for is a tunnel timeout, and then configuring based on that. Even if we just skipped the tunnel timeout part of configuration if defaults wasn't defined, that would avoid this behaviour, which is basically what's happening now from what I can tell. I'm not familiar with the flag logic in use here enough to say why this might be running before the package is installed (and therefore the config is in place). I'll try reproduce, and see if I can also use one of the apt flags to guard this function from running before the installs have finished or something?

chris-sanders commented 5 years ago

Actually, tunnel_timeout is not part of the default config, this is the one option that the charm hard-codes to on. At least looking through the code, this is already protected by haproxy.installed flag. Is it possible fetch.install is non-blocking such that the installed flag is set but the install is not yet complete?

Maybe using the apt-layer would solve this, the more I look at this the more convinced I am that the fix should stop the error but clobber the package defaults. I just can't figure out how fetch could install the package and the the config file not be written by the time the next hook is run and tunnel_timeout is run.

I've got an assumption here that's wrong somewhere because I can't see how this is triggering.