apache / cassandra-gocql-driver

GoCQL Driver for Apache Cassandra®
https://cassandra.apache.org/
Apache License 2.0
2.58k stars 620 forks source link

"no connections were made when creating the session" error #946

Closed andrewdeandrade closed 7 years ago

andrewdeandrade commented 7 years ago

Pull Request #888 is causing CreateSession to exit with a "no connections were made when creating the session" error. I am at a loss on how to provide more details on this error since it is pretty vague.

cc/ @rkuris

Zariel commented 7 years ago

build with gocql_debug tag and share the logs

megaherz commented 7 years ago

I'm new to gocql, but have the same issue:

package main

import (
    "fmt"
    "github.com/gocql/gocql"
)

func main() {
    cluster := gocql.NewCluster("127.0.0.1")
    _, err := cluster.CreateSession()
    if err != nil {
        panic(err)
    }
    fmt.Println("cassandra init done")
}

Resulting in

go build -tags="gocql_debug"
 ./playground 

2017/08/09 13:38:16 gocql: Session.handleNodeUp: 127.0.0.1:9042
2017/08/09 13:38:18 unable to dial "172.20.0.6": dial tcp 172.20.0.6:9042: i/o timeout
2017/08/09 13:38:18 gocql: Session.handleNodeDown: 172.20.0.6:9042
2017/08/09 13:38:20 unable to dial "172.20.0.3": dial tcp 172.20.0.3:9042: i/o timeout
2017/08/09 13:38:20 gocql: Session.handleNodeDown: 172.20.0.3:9042
2017/08/09 13:38:20 gocql: Session.handleNodeUp: 172.20.0.6:9042
2017/08/09 13:38:22 unable to dial "172.20.0.6": dial tcp 172.20.0.6:9042: i/o timeout
2017/08/09 13:38:22 gocql: Session.handleNodeUp: 172.20.0.3:9042
2017/08/09 13:38:22 gocql: Session.handleNodeDown: 172.20.0.6:9042
2017/08/09 13:38:24 unable to dial "172.20.0.3": dial tcp 172.20.0.3:9042: i/o timeout

panic: no connections were made when creating the session

Cassandra is running as 2 nodes in docker container.

Connecting with cqlsh results in

cqlsh --cqlversion=3.4.4
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.0 | CQL spec 3.4.4 | Native protocol v4]
cqlsh> describe keyspaces

system_schema  system_auth  system  system_distributed  system_traces 
Zariel commented 7 years ago

Can you post the following:

megaherz commented 7 years ago

@Zariel cassandra version: Cassandra 3.11.0

cqlsh> SELECT peer, preferred_ip, rpc_address FROM system.peers ;

 peer       | preferred_ip | rpc_address
------------+--------------+-------------
 172.20.0.6 |         null |  172.20.0.6
Zariel commented 7 years ago

Is 172.20.0.6 routable from your client? Cassandra is telling Goqcl that its rpc_address is that so thats what its trying to dial, is it running in Docker? Regardless, set broadcast_rpc_address to the address you expect to be able to dial

megaherz commented 7 years ago

@Zariel I'll check broadcast_rpc a bit later, but yes cassandra is running as a docker container (mac os). Here is a compose.yml

    hostname: cassandra-1
    image: cassandra:latest
    command: /bin/bash -c "sleep 1 && echo ' -- Pausing to let system catch up ... -->' && /docker-entrypoint.sh cassandra -f"
    ports:
      - "9042:9042"
    expose:
      - 7000
      - 7001
      - 7199
      - 9042
      - 9160

  cassandra-2:
    hostname: cassandra-2
    image: cassandra:latest
    command: /bin/bash -c "sleep 30 && echo ' -- Pausing to let system catch up ... -->' && /docker-entrypoint.sh cassandra -f"
    environment:
      - CASSANDRA_SEEDS=cassandra-1
    links:
      - cassandra-1
    expose:
      - 7000
      - 7001
      - 7199
      - 9042
      - 9160

Also I must notice that with the https://github.com/gocql/gocql/tree/7e9748ccda7fd5135a7db13ba03f09cad0c86bed revision there is no such an issue.

tucke commented 7 years ago

Excuse me, Has the problem been solved? I have the same problem

johnweldon commented 7 years ago

I believe this is caused by docker on macOS specifically because the docker container IP address isn't routable (by default) from the macOS host directly. I was able to workaround this by setting the CASSANDRA_BROADCAST_ADDRESS in the docker compose.yml or in the command that starts the container.

In my case I did:

docker run ... -e CASSANDRA_BROADCAST_ADDRESS=127.0.0.1 -p 9042:9042 ... cassandra

APTy commented 7 years ago

I'm, also having this issue running C* 2.2.9 in a docker container on Linux. Pinning to 7e9748ccda7fd5135a7db13ba03f09cad0c86bed has fixed the issue, though I'd prefer a more permanent solution.

I tried setting all of CASSANDRA_BROADCAST_ADDRESS, CASSANDRA_RPC_ADDRESS, and CASSANDRA_LISTEN_ADDRESS to 127.0.0.1 with no success.

Oddly, I don't see the issue if I explicitly expose the port with docker run ... -p 9042:9042. I only see it when I ask it to choose a random port with docker run ... -P. I'm perplexed by that part.

Zariel commented 7 years ago

@APTy @johnweldon @tuyz @megaherz can anyone provide reproducing steps?

johnweldon commented 7 years ago

@Zariel In my case, the setup was this:

Apparently the initial connect worked, but then the node advertised an IP address specific to docker, which was not visible outside of docker, and gocql tried to connect to that IP address and (understandably) failed because the docker specific IP was not routable from the OSX host.

Zariel commented 7 years ago

What cassandra image? With what config? Which version?

johnweldon commented 7 years ago

I'm sorry @Zariel - I don't have that setup any more, based on what I can reconstruct this is what I think was running:

$ docker images | grep cassandra
cassandra               3                      535a7b98d04d        4 weeks ago         386MB
docker run -d \
    --restart=always \
    -e CASSANDRA_BROADCAST_ADDRESS=192.168.199.199 \
    -p 7000-7001:7000-7001 \
    -p 7199:7199 \
    -p 9042:9042 \
    -p 9160:9160 \
  cassandra:3

Without the CASSANDRA_BROADCAST_ADDRESS I got the error where it was trying to dial the docker internal IP address 172.?.?.? instead of the assigned external address 192.168.199.199

Once I added the envar it worked.

My client application is configured to connect to cassandra on 192.168.199.199:9042

BasPH commented 7 years ago

Same issue here. Going back in commits, I receive the error at commit 77431609f517cb41ee9afdcdd373561c4d935316. With code before that commit, I can connect without issues.

Zariel commented 7 years ago

I can only get it to work if I correctly configure cassandra running inside docker to advertise its broadcast address as the docker-machine ip value via docker run -e CASSANDRA_BROADCAST_ADDRESS=$(docker-machine ip) -p 9042:9042 library/cassandra which is what I would expect.

The reason that 7743160 made this no longer work is that cassandra is telling the driver that it is available at (for me) 172.17.0.2, the ring up to this point looks like

control: 192.168.99.100 192.168.99.100: UP

Then the control connection triggers a refresh of the ring, system.local looks like

 listen_address | broadcast_address
----------------+-------------------
     172.17.0.2 |        172.17.0.2

The driver then removes 192.168.99.100 from its local ring due it not being in system.peers or system.local and adds 172.17.0.2. At this point the driver checks to see if it has any connections in the connection pool, which it does not so returns the (admittedly poor) ErrNoConnections error.

In a correctly configured environment this all happens as expected and the driver connects and will work fine.

Related there is an issue that the driver uses the remote addr of the control connection and adds that to the pool instead of doing a lookup in system.local which is why we end up having the host removed and it working pre 7743160 and that this did not show up until then.

I'm hesitant to change this behaviour as the real issue is cassandra is not configured correctly, and if this were a prod cluster I would expect the driver to error out because its not configured properly instead of having hacks in different places in the driver to work around invalid cassandra configurations. We should improve that error message though as it is thoroughly unhelpful and improve documentation about using the driver and what assumptions it makes about the cluster it is connecting to.


Note that cqlsh works without this setting, I'm not entirely sure what setup it is using when doing host discovery

calvn commented 7 years ago

In a correctly configured environment this all happens as expected and the driver connects and will work fine.

How can we configure docker-cassandra so the driver does not remove the broadcast address from its local ring?

Zariel commented 7 years ago

It wont as long as the value of broadcast_address is reachable from the driver, try docker run -e CASSANDRA_BROADCAST_ADDRESS=$(docker-machine ip) -p 9042:9042 library/cassandra

ror6ax commented 7 years ago

We're hit by this through Vault, as marked in the issue above.

Broadcast address is set properly in our case, so this is not the root cause of the problem.

Can I suggest a revert of 7e9748c until investigations are being done?

ror6ax commented 7 years ago

@Zariel - can you please comment on this? Thanks a ton.

Zariel commented 7 years ago

@ror6ax can you please do

SELECT listen_address, rpc_address, broadcast_address FROM system.local; and SELECT peer, rpc_address, preferred_ip FROM system.peers

and if possible rebuild vault with gocql_debug tag and post the output

ror6ax commented 7 years ago

Here you go: SELECT listen_address, rpc_address, broadcast_address FROM system.local; gives

listen_address='10.255.11.243', rpc_address='10.255.11.243', broadcast_address='10.255.11.243'

and SELECT peer, rpc_address, preferred_ip FROM system.peers gives

peer='10.255.8.12', rpc_address='10.255.8.12', preferred_ip=None
peer='10.255.7.69', rpc_address='10.255.7.69', preferred_ip=None
ror6ax commented 7 years ago

@Zariel - does this tell you anything new? We're still not able to make Vault work...

ror6ax commented 7 years ago

We are unable to use gocql unless with https://github.com/gocql/gocql/pull/888/commits/43497d0755ed17a779855435df40474fe21171a7 reversed. Please advise.

Zariel commented 7 years ago

can you please open another ticket and try to figure out WHY that should be reverted? I want to understand WHY this is causing an issue so that a test can be added and the issue fixed instead of just knee jerk revert.

maps90 commented 7 years ago

I'm getting the same issue, if i'm using proxy to connect to cassandra. i need to revert to 7e9748c to get it worked

2017/10/06 19:03:11 gocql: Session.handleNodeUp: 127.0.0.1:9042
2017/10/06 19:03:13 unable to dial "192.168.1.151": dial tcp 192.168.1.151:9042: i/o timeout
2017/10/06 19:03:13 gocql: Session.handleNodeDown: 192.168.1.151:9042
2017/10/06 19:03:15 unable to dial "192.168.1.150": dial tcp 192.168.1.150:9042: i/o timeout
2017/10/06 19:03:15 gocql: Session.handleNodeDown: 192.168.1.150:9042
2017/10/06 19:03:15 gocql: Session.handleNodeUp: 192.168.1.151:9042
2017/10/06 19:03:17 unable to dial "192.168.1.151": dial tcp 192.168.1.151:9042: i/o timeout
2017/10/06 19:03:17 gocql: Session.handleNodeDown: 192.168.1.151:9042
2017/10/06 19:03:17 gocql: Session.handleNodeUp: 192.168.1.150:9042
2017/10/06 19:03:19 unable to dial "192.168.1.150": dial tcp 192.168.1.150:9042: i/o timeout
2017/10/06 19:03:19 gocql: Session.handleNodeDown: 192.168.1.150:9042
2017/10/06 19:03:19 cassandra DB Connection Error:  no connections were made when creating the session
ror6ax commented 7 years ago

@Zariel - would you accept PR with a flag to disable the related functionality in gocql? I suspect, just like in my case, you don't change Cassandra prod setup just like that.

Zariel commented 7 years ago

You can already disable the initial host lookup and all host events if you like, https://github.com/gocql/gocql/blob/2416cf340d32ee20794e739fa794968858295098/cluster.go#L98

Zariel commented 7 years ago

@ror6ax also I would like to fix the root cause of the issue and have a test that proves its fixed otherwise there is no way to know if it gets introduced again. No one has yet debugged and figured out the root cause and instead just saying revert X is not helpful as it just bandages over the issue.

pag-r commented 6 years ago

It looks like the port is ignored even when added manually into the code:

package main

import (
    "fmt"
    "github.com/gocql/gocql"
)

func main() {
    cluster := gocql.NewCluster("127.0.0.1")
    cluster.Port = 9043
    _, err := cluster.CreateSession()
    if err != nil {
        panic(err)
    }
    fmt.Println("cassandra init done")
}

The output of the command is: panic: gocql: unable to create session: unable to discover protocol version: dial tcp 127.0.0.1:9043: getsockopt: connection refused

Zariel commented 6 years ago

I dont know what you mean, in the error it dialled 127.0.0.1:9043 what did you expect to happen? tcp 127.0.0.1:9043: getsockopt: connection refused

pag-r commented 6 years ago

Yes, you're right this doesn't explain anything, and it's not gocql issue. The issue itself remains connected to Vault and using always default port, even when different is set.

jefferai commented 6 years ago

@Zariel I believe that this used to work. It may be a purposeful breaking change, but if not, perhaps both styles could be accepted.

calvn commented 6 years ago

I think that the issue here is twofold. There was a change in behavior which is now less forgiving if the Cassandra configuration is not set properly (i.e. the broadcast address is incorrectly specified) as pointed out in https://github.com/gocql/gocql/issues/946#issuecomment-326807642, which is a completely valid reason for a fix.

However, I believe that the underlying root cause on continued no connections were made when creating the session errors even after correctly setting the broadcast address is that gocql is not respecting the port that is passed as part of the host. The error message just happened to be the same, which made things a bit confusing. I've opened a separate GH issue with greater detail and repro cases on this.

The example provided in https://github.com/gocql/gocql/issues/946#issuecomment-339659163 is for the successful case where the port is provided explicitly (i.e. cluster.Port = 9043), and not for the case where the bug occurs when the port is passed as part of the host ("127.0.0.1:9043").

Emixam23 commented 4 years ago

I am having the same issue, I am trying it for the first time but running into no connections were made when creating the session.

I saw some people using docker but where does this Docker come from?

alourie commented 4 years ago

@Emixam23 I believe people are using docker to run the Cassandra cluster, this has nothing to do with gocql. What exactly is the issue you're having?

Emixam23 commented 4 years ago

Actually I did find out, it says that no connections were made when creating the session when an error happens, it, however, doesn't matter what error it is (as far as I could see)

My issue was that the keyspace wasn't existing in my local database..

deepakyadav commented 2 years ago

In my case, I had set NumConns to 0 by mistake.

StephanieSunshine commented 1 year ago

Hello, I just ran into this as a problem that would sporadically happen.

In my case, no connections were made when creating the session wouldn't always happen. I had checked the listen_address, rpc_address, and broadcast_address. At the time they had been set to the local network ip, localhost, and the network ip (their ip's specifically, not their hostnames, this comes up later). I changed all of these addresses to 127.0.0.1 in the scylla.yaml server conf and restarted. The cqlsh verified the changes, but my application still was having a problem where the simple query I had made wouldn't always run, same error message. I found that when I changed my code to connect to the ip address of localhost instead of using localhost as a host for dns lookup, my problem went away.

/* from */
c := gocql.NewCluster("localhost")
/* to */
c := gocql.NewCluster("127.0.0.1")

My hosts file has an entry for localhost and my local system utilities have no problem finding it. However, I just noticed that my hosts file has an ipv4 and ipv6 entry for localhost. This makes me wonder if sometimes gocql when it is doing its DNS lookup, that sometimes it is trying to connect with the ipv6 address, where Scylla isn't running, causing the error.

My operating system is Debian Bullseye

realcp1018 commented 1 year ago

As mentioned by cavln in https://github.com/gocql/gocql/issues/946#issuecomment-340070506, I believe it would be more beneficial to provide a clearer error stack for this error. In my case I changed config name for *ClusterConfig.NumConns, thus it used NumConns=0 by default, which obviously result in a no connections were made . A clearer root cause in ClusterConfig would be very helpful, otherwise we'll need to inspect every releated things: host? port? or NumConns?

adi-kmtGD commented 3 months ago

Can you post the following:

  • output of nodetool status
  • output of SELECT peer, preferred_ip, rpc_address FROM system.peers;
  • cassandra version

@Zariel I'm getting

SELECT peer, preferred_ip, rpc_address FROM system.peers;

peer | preferred_ip | rpc_address ------+--------------+-------------