apache / cassandra-gocql-driver

GoCQL Driver for Apache Cassandra®
https://cassandra.apache.org/
Apache License 2.0
2.57k stars 616 forks source link

HostPoolHostPolicy causes infinite loop/spin when used as fallback for TokenAwareHostPolicy #1259

Open NWilson opened 5 years ago

NWilson commented 5 years ago

What version of Cassandra are you using?

2.2.8 (but not relevant)

What version of Gocql are you using?

v0.0.0-20181124151448-70385f88b28b (in go.mod)

What did you do?

Used a HostPoolHostPolicy as the fallback for a TokenAwareHostPolicy:

cluster.PoolConfig.HostSelectionPolicy = gocql.TokenAwareHostPolicy(
    gocql.HostPoolHostPolicy(
        hostpool.NewEpsilonGreedy(nil, 0, &hostpool.LinearEpsilonValueCalculator{}),
    ),
)

What did you expect to see?

The token-aware policy falls back correctly to the host-pool policy.

What did you see instead?

The CPU spins at 100%. The gocql code has entered a busy loop.


Cause

In the (*tokenAwareHostPolicy).Pick function, the fallbackIter is repeatedly queried until it returns nil. However, HostPoolHostPolicy never returns nil, and repeated calls return the same host repeatedly, so the loop spins forever.

Suggested fix:

 func (r *hostPoolHostPolicy) Pick(qry ExecutableQuery) NextHost {
+   used := false
    return func() SelectedHost {
+       if used {
+           return nil
+       }
+       used = true
+
        r.mu.RLock()
        defer r.mu.RUnlock()

        if len(r.hostMap) == 0 {
            return nil
        }

        hostR := r.hp.Get()
        host, ok := r.hostMap[hostR.Host()]
        if !ok {
            return nil
        }

        return selectedHostPoolHost{
            policy: r,
            info:   host,
            hostR:  hostR,
        }
    }
 }
beltran commented 5 years ago

Looks like a valid problem and solution to me, I'd open a PR.

dankinder commented 5 years ago

I think what I ran into here is the same root cause. The problem is reliably reproducible by setting a RetryPolicy. This seems to do it:

package main

import (
    "github.com/gocql/gocql"
    hostpool "github.com/hailocab/go-hostpool"
)

/*
Requires this local keyspace:

create keyspace test WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1};
use test;
create table mytable (id text primary key, data text);
*/

func main() {
    conf := gocql.NewCluster("127.0.0.1")
    conf.Keyspace = "test"
    conf.RetryPolicy = &gocql.SimpleRetryPolicy{NumRetries: 3}
    conf.PoolConfig.HostSelectionPolicy = gocql.TokenAwareHostPolicy(gocql.HostPoolHostPolicy(
        hostpool.NewEpsilonGreedy(nil, 0, &hostpool.LinearEpsilonValueCalculator{}),
    ))

    db, err := conf.CreateSession()
    if err != nil {
        panic(err.Error())
    }

    // This query completes fine:
    itr1 := db.Query(`SELECT * FROM mytable WHERE id = ?`, "something").Iter()
    var id string
    for itr1.Scan(&id) {
    }
    if err := itr1.Close(); err != nil {
        panic(err.Error())
    }

    // This one hangs forever
    itr2 := db.Query(`SELECT * FROM mytable WHERE id = ?`, "").Iter()
    for itr2.Scan(&id) {
    }
    if err := itr1.Close(); err != nil {
        panic(err.Error())
    }
}                                                                                       }
alourie commented 5 years ago

@dankinder I just ran this and it doesn't hang on my system. Should I do anything else for this to work (I mean fail ;-) )?

Also, I don't see why would this all even be a problem. The Pick call is used by the query_executor to find the host to run the query with, it just basically goes over all the valid (active) hosts in the pool, and uses them to connect to the DB. After the specified number of retries, the query will stop (unless unlimited retries are set via retry policy). So why would this hang at all?

dankinder commented 5 years ago

Should have noted a few more details here, I'm using go 1.12.1 on centos 7, Cassandra 2.2.8.

    github.com/bitly/go-hostpool v0.0.0-20171023180738-a3a6125de932 // indirect
    github.com/bmizerany/assert v0.0.0-20160611221934-b7ed37b82869 // indirect
    github.com/gocql/gocql v0.0.0-20190402132108-0e1d5de854df
    github.com/hailocab/go-hostpool v0.0.0-20160125115350-e80d13ce29ed
    github.com/kr/pretty v0.1.0 // indirect
    github.com/stretchr/testify v1.3.0 // indirect

Honestly I have not dug into why this hangs, but I just tried it with the code I pasted in a little module with this dependency list and it hangs. Can you reproduce if you use those same package versions?

alourie commented 5 years ago

@dahankzter tried that, using Cassandra 2.2.8, still doesn't hang. I still don't follow the logic as to why would it hang at all, unless it's been set so via a retry policy.

dankinder commented 5 years ago

Interestingly I ran this on a mac and it worked fine, but it seems to lock up on Centos 7. Did you try a linux system?

Running my test program in delve I saw this backtrace:

 0  0x000000000045149e in time.now
    at /usr/local/go/src/runtime/timestub.go:17
 1  0x00000000004cd7cd in time.Now
    at /usr/local/go/src/time/time.go:1088
 2  0x00000000006bcfdc in github.com/hailocab/go-hostpool.(*epsilonGreedyHostPool).getEpsilonGreedy
    at /home/dkinder/go/pkg/mod/github.com/hailocab/go-hostpool@v0.0.0-20160125115350-e80d13ce29ed/epsilon_greedy.go:143
 3  0x00000000006bccf2 in github.com/hailocab/go-hostpool.(*epsilonGreedyHostPool).Get
    at /home/dkinder/go/pkg/mod/github.com/hailocab/go-hostpool@v0.0.0-20160125115350-e80d13ce29ed/epsilon_greedy.go:117
 4  0x0000000000747176 in github.com/gocql/gocql.(*hostPoolHostPolicy).Pick.func1
    at /home/dkinder/go/pkg/mod/github.com/gocql/gocql@v0.0.0-20190423091413-b99afaf3b163/policies.go:699
 5  0x0000000000746fae in github.com/gocql/gocql.(*tokenAwareHostPolicy).Pick.func1
    at /home/dkinder/go/pkg/mod/github.com/gocql/gocql@v0.0.0-20190423091413-b99afaf3b163/policies.go:588
 6  0x0000000000734c15 in github.com/gocql/gocql.(*queryExecutor).do
    at /home/dkinder/go/pkg/mod/github.com/gocql/gocql@v0.0.0-20190423091413-b99afaf3b163/query_executor.go:141
 7  0x000000000073422d in github.com/gocql/gocql.(*queryExecutor).executeQuery
    at /home/dkinder/go/pkg/mod/github.com/gocql/gocql@v0.0.0-20190423091413-b99afaf3b163/query_executor.go:60
 8  0x0000000000739183 in github.com/gocql/gocql.(*Session).executeQuery
    at /home/dkinder/go/pkg/mod/github.com/gocql/gocql@v0.0.0-20190423091413-b99afaf3b163/session.go:407
 9  0x000000000073cf31 in github.com/gocql/gocql.(*Query).Iter
    at /home/dkinder/go/pkg/mod/github.com/gocql/gocql@v0.0.0-20190423091413-b99afaf3b163/session.go:1103
10  0x0000000000756e6f in main.main
    at ./main.go:44
11  0x00000000004338ff in runtime.main
    at /usr/local/go/src/runtime/proc.go:200
12  0x00000000004604b1 in runtime.goexit
    at /usr/local/go/src/runtime/asm_amd64.s:1337

I walked through it and strangely, it returned successfully.

I did it again and stopped it here:

received SIGINT, stopping process (will not forward signal)
> github.com/hailocab/go-hostpool.(*hostEntry).getWeightedAverageResponseTime() /home/dkinder/go/pkg/mod/github.com/hailocab/go-hostpool@v0.0.0-20160125115350-e80d13ce29ed/host_entry.go:50 (PC: 0x6bde7f)
    45:     var lastValue float64
    46: 
    47:     // start at 1 so we start with the oldest entry
    48:     for i := 1; i <= epsilonBuckets; i += 1 {
    49:         pos := (h.epsilonIndex + i) % epsilonBuckets
=>  50:         bucketCount := h.epsilonCounts[pos]
    51:         // Changing the line below to what I think it should be to get the weights right
    52:         weight := float64(i) / float64(epsilonBuckets)
    53:         if bucketCount > 0 {
    54:             currentValue := float64(h.epsilonValues[pos]) / float64(bucketCount)
    55:             value += currentValue * weight
(dlv) bt
 0  0x00000000006bde7f in github.com/hailocab/go-hostpool.(*hostEntry).getWeightedAverageResponseTime
    at /home/dkinder/go/pkg/mod/github.com/hailocab/go-hostpool@v0.0.0-20160125115350-e80d13ce29ed/host_entry.go:50
 1  0x00000000006bd0e8 in github.com/hailocab/go-hostpool.(*epsilonGreedyHostPool).getEpsilonGreedy
    at /home/dkinder/go/pkg/mod/github.com/hailocab/go-hostpool@v0.0.0-20160125115350-e80d13ce29ed/epsilon_greedy.go:147
 2  0x00000000006bccf2 in github.com/hailocab/go-hostpool.(*epsilonGreedyHostPool).Get
    at /home/dkinder/go/pkg/mod/github.com/hailocab/go-hostpool@v0.0.0-20160125115350-e80d13ce29ed/epsilon_greedy.go:117
 3  0x0000000000747176 in github.com/gocql/gocql.(*hostPoolHostPolicy).Pick.func1
    at /home/dkinder/go/pkg/mod/github.com/gocql/gocql@v0.0.0-20190423091413-b99afaf3b163/policies.go:699
 4  0x0000000000746fae in github.com/gocql/gocql.(*tokenAwareHostPolicy).Pick.func1
    at /home/dkinder/go/pkg/mod/github.com/gocql/gocql@v0.0.0-20190423091413-b99afaf3b163/policies.go:588
 5  0x0000000000734c15 in github.com/gocql/gocql.(*queryExecutor).do
    at /home/dkinder/go/pkg/mod/github.com/gocql/gocql@v0.0.0-20190423091413-b99afaf3b163/query_executor.go:141
 6  0x000000000073422d in github.com/gocql/gocql.(*queryExecutor).executeQuery
    at /home/dkinder/go/pkg/mod/github.com/gocql/gocql@v0.0.0-20190423091413-b99afaf3b163/query_executor.go:60
 7  0x0000000000739183 in github.com/gocql/gocql.(*Session).executeQuery
    at /home/dkinder/go/pkg/mod/github.com/gocql/gocql@v0.0.0-20190423091413-b99afaf3b163/session.go:407
 8  0x000000000073cf31 in github.com/gocql/gocql.(*Query).Iter
    at /home/dkinder/go/pkg/mod/github.com/gocql/gocql@v0.0.0-20190423091413-b99afaf3b163/session.go:1103
 9  0x0000000000756e6f in main.main
    at ./main.go:44
10  0x00000000004338ff in runtime.main
    at /usr/local/go/src/runtime/proc.go:200
11  0x00000000004604b1 in runtime.goexit
    at /usr/local/go/src/runtime/asm_amd64.s:1337

This time I let it continue after getting the backtrace and it just sits there hanging. After about 30 seconds I stopped it again and saw this very similar backtrace:

0  0x00000000006bdf0f in github.com/hailocab/go-hostpool.(*hostEntry).getWeightedAverageResponseTime
    at /home/dkinder/go/pkg/mod/github.com/hailocab/go-hostpool@v0.0.0-20160125115350-e80d13ce29ed/host_entry.go:48
 1  0x00000000006bd0e8 in github.com/hailocab/go-hostpool.(*epsilonGreedyHostPool).getEpsilonGreedy
    at /home/dkinder/go/pkg/mod/github.com/hailocab/go-hostpool@v0.0.0-20160125115350-e80d13ce29ed/epsilon_greedy.go:147
 2  0x00000000006bccf2 in github.com/hailocab/go-hostpool.(*epsilonGreedyHostPool).Get
    at /home/dkinder/go/pkg/mod/github.com/hailocab/go-hostpool@v0.0.0-20160125115350-e80d13ce29ed/epsilon_greedy.go:117
 3  0x0000000000747176 in github.com/gocql/gocql.(*hostPoolHostPolicy).Pick.func1
    at /home/dkinder/go/pkg/mod/github.com/gocql/gocql@v0.0.0-20190423091413-b99afaf3b163/policies.go:699
 4  0x0000000000746fae in github.com/gocql/gocql.(*tokenAwareHostPolicy).Pick.func1
    at /home/dkinder/go/pkg/mod/github.com/gocql/gocql@v0.0.0-20190423091413-b99afaf3b163/policies.go:588
 5  0x0000000000734c15 in github.com/gocql/gocql.(*queryExecutor).do
    at /home/dkinder/go/pkg/mod/github.com/gocql/gocql@v0.0.0-20190423091413-b99afaf3b163/query_executor.go:141
 6  0x000000000073422d in github.com/gocql/gocql.(*queryExecutor).executeQuery
    at /home/dkinder/go/pkg/mod/github.com/gocql/gocql@v0.0.0-20190423091413-b99afaf3b163/query_executor.go:60
 7  0x0000000000739183 in github.com/gocql/gocql.(*Session).executeQuery
    at /home/dkinder/go/pkg/mod/github.com/gocql/gocql@v0.0.0-20190423091413-b99afaf3b163/session.go:407
 8  0x000000000073cf31 in github.com/gocql/gocql.(*Query).Iter
    at /home/dkinder/go/pkg/mod/github.com/gocql/gocql@v0.0.0-20190423091413-b99afaf3b163/session.go:1103
 9  0x0000000000756e6f in main.main
    at ./main.go:44
10  0x00000000004338ff in runtime.main
    at /usr/local/go/src/runtime/proc.go:200
11  0x00000000004604b1 in runtime.goexit
    at /usr/local/go/src/runtime/asm_amd64.s:1337

So I suppose the issue may be in go-hostpool but that's not really for sure.

Sorry I don't have time to go further with this right now, if others can't reproduce I may pick it back up later to dive a little deeper.

NWilson commented 5 years ago

@alourie

The Pick call is used by the query_executor to find the host to run the query with, it just basically goes over all the valid (active) hosts in the pool, and uses them to connect to the DB.

The issue is that it doesn't, for the host pool one. It returns the host pool's best host, again and again. The NextHost that's returned has no state - violating the expectation of callers which expect to be able call it until it returns nil. It never returns nil for the host-pool NextHost fn, it just samples the pool again and again.