pg_basebackup: could not connect to server: could not translate host name

ramonskie commented 8 years ago

just tried your release but i get this strange error

[repl:slave] DATADIR (/var/vcap/store/postgres/db) not found; seeding from preferred master ([172.21.42.167])
pg_basebackup: could not connect to server: could not translate host name "[172.21.42.167]" to address: Name or service not known
pg_ctl: directory "/var/vcap/store/postgres/db" does not exist

template i use http://pastebin.com/vv0n9Zxw

it fails on the first node (master)

  Started updating job postgres
  Started updating job postgres > postgres/0 (canary). Failed: `postgres/0' is not running after update (00:01:29)

Error 400007: `postgres/0' is not running after update

jhunt commented 8 years ago

The postgres.replication.master property needs to be a scalar, not a list. Try changing this:

    properties:
      postgres:
        replication:
          master:
            - 172.21.42.167

to this:

    properties:
      postgres:
        replication:
          master: 172.21.42.167

The startup script is stringifying your array, which is where those weird square brackets come from in could not translate host name "[172.21.42.167]"

ramonskie commented 8 years ago

okay different error now..

[repl:slave] DATADIR (/var/vcap/store/postgres/db) not found; seeding from preferred master (172.21.42.167)
pg_basebackup: could not connect to server: could not connect to server: Connection refused
        Is the server running on host "172.21.42.167" and accepting
        TCP/IP connections on port 6432?

which is a bit of inception because postgres needs to be up and running before it can connects to itself

i also see the following in the monit/postgres log

ls: cannot access /var/vcap/packages/*/*/*.jar: No such file or directory
$PATH /var/vcap/packages/postgres/bin:/var/vcap/packages/pgpool2/bin:/bin:/usr/bin:/sbin:/usr/sbin

jhunt commented 8 years ago

Can you pastebin /var/vcap/jobs/postgres/bin/ctl?

ramonskie commented 8 years ago

http://pastebin.com/SJ6Ns47p

jhunt commented 8 years ago

Is your deployment multi-homed?

ramonskie commented 8 years ago

multi-homed? i don't get what you mean

jhunt commented 8 years ago

You've got two networks on your postgres nodes:

    networks:
      - name: default
        default: [dns, gateway]
      - name: floating
        static_ips:
          - 172.21.42.167
          - 172.21.42.170
          - 172.21.42.171

On postgres/0 (master), the interface attached to the default network is getting the IP 172.21.28.224, which doesn't match the configured master IP, and causing the bin/ctl script to think that the node is actually a slave to 172.21.42.167, causing the chicken/egg problem with seeding.

(Full disclosure, I haven't tested this on AWS yet, just vSphere and Warden CPIs)

Can you flip the order of the network definitions to this:

jobs:
  - name: postgres
    # ...
    networks:
      - name: floating
        static_ips:
          - 172.21.42.167
          - 172.21.42.170
          - 172.21.42.171
      - name: default
        default: [dns, gateway]

ramonskie commented 8 years ago

i have got just 1 network the static ips are floating ips in openstack and not known by the vm itself they are just routed to the ips they get from the dymanicly defautlt network

for example vm get ip 10.0.0.1 172.21.42.167 is routed to 10.0.0.1 so the floating ip is not configured in the vm itself

jhunt commented 8 years ago

Odd. what is 172.21.28.224?

ramonskie commented 8 years ago

thats the dynamically assigned ip address

jhunt commented 8 years ago

Currently, this release relies on address introspection to determine who the master node is. If the floating statics are managed/routed external to the VM itself, the release (as written) will not work.

Can you pastebin ip addr show from the postgres/0 node?

ramonskie commented 8 years ago

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether fa:16:3e:54:aa:5c brd ff:ff:ff:ff:ff:ff
    inet 172.21.28.224/24 brd 172.21.28.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe54:aa5c/64 scope link 
       valid_lft forever preferred_lft forever

jhunt commented 8 years ago

Drat.

I will work up a patch to force the master to be the 0th node, since we only support one master. That will take care of this issue.

ramonskie commented 8 years ago

i thing this is a pretty specific issue for a small set of users. but if you can apply a patch it would be awsome :+1:

jhunt commented 8 years ago

Can you try a dev-release off of origin/master? ce8057c should fix this for you. If so, I'll go ahead and cut a new final version.

ramonskie commented 8 years ago

postgres(0) now works but 1 fails

[repl:slave] DATADIR (/var/vcap/store/postgres/db) not found; seeding from preferred master (172.21.42.167)
pg_basebackup: could not connect to server: FATAL:  number of requested standby connections exceeds max_wal_senders (currently 0)
pg_ctl: directory "/var/vcap/store/postgres/db" does not exist

the only thing in my postgres.conf is

# postgres main configuration
port = 6432
listen_addresses = '*'
hot_standby = 'on'

jhunt commented 8 years ago

Oops. Missed a "does my IP == master" in the postgresql.conf. Try now? (commit db6a5cb has the fix for max_wal_senders)

ramonskie commented 8 years ago

deployment is now successful :+1:

jhunt commented 8 years ago

Awesome. I'm going to hold off on cutting a new version, since I'm still working through some smoke-test issues and we still aren't at v1 :smiley:

cloudfoundry-community / postgres-boshrelease

pg_basebackup: could not connect to server: could not translate host name #2