apple / foundationdb

FoundationDB - the open source, distributed, transactional key-value store
https://apple.github.io/foundationdb/
Apache License 2.0
14.33k stars 1.3k forks source link

Internal RPC error with basic fdbcli connection #222

Open atombender opened 6 years ago

atombender commented 6 years ago
$ /usr/local/bin/fdbcli -C ./client.cluster
Using cluster file `./client.cluster'.
Internal Error @ fdbrpc/FlowTransport.actor.cpp 657:
  atos -o fdbcli.debug -arch x86_64 -l 0x101fe7000 0x10233a5cd 0x1022a5e91 0x1022a514d 0x101feca78 0x102361916 0x102349631 0x1020bae5b 0x10200a6de 0x7fff59e48115
[lots more errors]

Server: Custom Docker container based on ubuntu:16.04 that has installed foundationdb-clients_5.1.5-1_amd64.deb and foundationdb-server_5.1.5-1_amd64.deb. Using default config.

Client: Installed with macOS installer, FoundationDB-5.1.5.pkg. macOS 10.13.3.

Full fdbcli output and logs from server here.

dkoston commented 6 years ago

@atombender I'm getting the same. Did you sort this out?

To add to this report, this error isn't specific to fdbcli, it also happens with the go client:

package main

import (
    "github.com/apple/foundationdb/bindings/go/src/fdb"
    "log"
    "fmt"
)

func main() {
    // Different API versions may expose different runtime behaviors.
    fdb.MustAPIVersion(510)

    // Open the default database from the system cluster
    db := fdb.MustOpenDefault()

    // Database reads and writes happen inside transactions
    ret, e := db.Transact(func(tr fdb.Transaction) (interface{}, error) {
    tr.Set(fdb.Key("hello"), []byte("world"))
    return tr.Get(fdb.Key("foo")).MustGet(), nil
    // db.Transact automatically commits (and if necessary,
    // retries) the transaction
    })
    if e != nil {
    log.Fatalf("Unable to perform FDB transaction (%v)", e)
    }

    fmt.Printf("hello is now world, foo was: %s\n", string(ret.([]byte)))
}

Connecting on the docker container to fdbcli works fine:

[~/testing/foundationdb-kubernetes](master [!])$ de docker_fdb_1
root@22f18ed2d173:/var/lib/foundationdb# fdbcli
Using cluster file `fdb.cluster'.

The database is available.

Welcome to the fdbcli. For help, type `help'.
fdb> writemode on
fdb> set "foo" "bar"
Committed (1052744750)
fdb> get "foo"
`foo' is `bar'
dkoston commented 6 years ago

This may be based on port mapping in docker. I:

Perhaps this line is checking to see if the port numbers match and failing out when the mapped port on the host doesn't match the source port on the container:

https://github.com/apple/foundationdb/blob/master/fdbrpc/FlowTransport.actor.cpp#L655

There's also another issue which is that fdbcli seems to try and use the config from the foundationdb server to connect to the cluster controller:

$ fdbcli -C fdb.cluster
Using cluster file `fdb.cluster'.

The database is unavailable; type `status' for more information.

Welcome to the fdbcli. For help, type `help'.
fdb> status

Using cluster file `fdb.cluster'.

Unable to communicate with the cluster controller at 172.18.0.2:4500 to get
status.

To fix this issue, I tried binding to 127.0.0.1:4500 but that is still having issues:


Could not communicate with a quorum of coordination servers:
  127.0.0.1:4500  (unreachable)
dkoston commented 6 years ago

Ah, fixed it on a single node by binding to 0.0.0.0:4500 and public address 127.0.0.1:4500

Still trying to sort out multiple nodes as that setup doesn't work

dkoston commented 6 years ago

Here's a docker-compose setup that handles the heavy lifting of running a cluster for you:

https://github.com/dkoston/foundationdb-kubernetes/tree/master/docker

Not sure what to advertise on the public_address to get a cluster working and to be able to connect outside docker.

Changing run.sh to use --public_address auto:4500 allows the cluster to spin up successfully but the host machine cannot connect. Using --public_address 127.0.0.1:4500 stops the cluster from spinning up as the machines can't talk to each other over the docker network.

I was hoping to proxy_pass with nginx on port 4500 but that results in Unable to locate a cluster controller within 2 seconds. Check that there are server processes running.

atombender commented 6 years ago

Thanks, @dkoston! I was never able to work around the error I got, since I'm deploying this on Kubernetes. I will look at your stuff and see if I can use it.

alexmiller-apple commented 5 years ago

Sorry, to finally leave a comment here:

FDB is probably a bit overly aggressive in asserting things about its network configuration, in a way that's meant to be helpful, but particularly in a container/kubernetes world might not be helpful. In particular, the NAT-ing behavior that docker does when re-exposing containerized services to the host results in two very unexpected effects: what the process thinks is its IP is not reachable externally, and what the process thinks its port is doesn't match what a peer sees its port as.

The short easy fix here is to just run your client in the same docker-compose created network, because then all the NAT behavior disappears. The longer fix involves things like allowing hostnames to be used for inter-cluster process lookup, and figuring out what to do about the port mismatch issues.

dkoston commented 5 years ago

@alexmiller-apple thanks for the reply. For those following the thread, I ended up doing the following:

This allows the app to re-resolve the hostname each time it restarts. If the connection drops, you can call the same function during reconnect to make sure the IP is up to date.

Having hostname resolution in fdbcli and the bindings themselves would be nice but it's not much code to write to work around the issue for now.

seancarroll commented 4 years ago

little late to the party, @dkoston do you have a sample of what you ended up doing?

dkoston commented 4 years ago

@seancarroll here’s a gist: https://gist.github.com/dkoston/9b41cfe44c82a345d3a2a664ae5b41cc

The libraries are from production code but some caveats: