Could you clarify timeouts for the Query?

un000 commented 10 months ago

I don't understand how to manage timeouts for Query. I got 20 retries with a sleep of 2 seconds and still issuing timeouts.

    cp := aerospike.NewClientPolicy()
    cp.Timeout = 5*time.Second
    cp.IdleTimeout = 10*time.Second

    qp := aerospike.NewQueryPolicy()
    qp.IncludeBinData = true
    qp.RecordQueueSize = 16 * 1024
    qp.FilterExpression = where
    qp.MaxRetries = 20
    qp.SleepBetweenRetries = 2*time.Second

    rs, err := c.client.Query(qp, statement)
    if err != nil {
        return fmt.Errorf("error executing Query: %w", err)
    }

    for result := range rs.Results() {
        if result.Err != nil {
            c.logger.Error("error scanning results", field.Error(err))     // < Timeout error here
            continue
        }

        if err := processFunc(result); err != nil {
            return fmt.Errorf("process func returned error: %w", err)
        }
    }

Per record processing time 150-250ms with 300 goroutines. What should I change to increase timeout from aerospike, because after 20 retries ~ after 60-70 seconds of working the code fails?

AS: Aerospike Community Edition build 5.6.0.5 Client: v6.13.0

khaf commented 10 months ago

Do you know what causes the timeouts? Do you have an unstable cluster/network? I have a bit of trouble reproducing this issue. We have found a case in which in some default configurations, adding a new node to the cluster could exhaust the max retries, but I presume that's not what you are observing here.

un000 commented 10 months ago

@khaf the cluster is stable. It connected with 10GB local network and there no issues except long partition scans.

khaf commented 10 months ago

Can you also include your Statement code? And the ExpressionFilter?

un000 commented 10 months ago

@khaf sure

    statement := aerospike.NewStatement(r.namespace, r.set)
    statement.Filter = aerospike.NewEqualFilter("intbin", 5555)

khaf commented 10 months ago

And the ExpressionFilter? How many records are there in the set? Do you have an estimate of how many records are going to be returned? And is it an in-memory or flash namespace?

un000 commented 10 months ago

set:

disable-eviction: "false"
ns: "namespace"
index_populating: "false"
objects: "36306534"
stop-writes-count: "0"
set: "setname"
enable-index: "false"
sindexes: "2"
memory_data_bytes: "33855994789"
device_data_bytes: "31576275424"
truncate_lut: "0"
tombstones: "0"

Indexes:

*************************** 1. row ***************************
ns: "namespace"
bin: "stringbin"
indextype: "NONE"
set: "setname"
state: "RW"
indexname: "stringbin_idx"
path: "stringbin"
type: "STRING"
*************************** 2. row ***************************
ns: "namespace"
bin: "stringbin"
indextype: "NONE"
set: "setname"
state: "RW"
indexname: "intbin_idx"
path: "stringbin"
type: "NUMERIC"

3 nodes setup with a multicast

service {
    cluster-name cluster
    user aerospike
    group aerospike
    paxos-single-replica-limit 1
    proto-fd-max 15000
    migrate-threads 6
}

namespace namespace {
    memory-size 110G
    replication-factor 2
    default-ttl 0
    nsup-period 120
    storage-engine device {
        cold-start-empty true

        file /var/aerospike/a.p1.db
        file /var/aerospike/a.p2.db
        file /var/aerospike/a.p3.db
        file /var/aerospike/a.p4.db
        filesize 64G
        data-in-memory true
        write-block-size 128K
    }
    migrate-sleep 0
    defrag-sleep 0
}

Estimated ~6 mlns of records of ~60 mlns in the set

ExpressionFilter isn't set

khaf commented 10 months ago

Thanks for the the detailed info. I'm on it, may take a couple of days though.

artursh commented 10 months ago

Hi. Looks like I have the similar problem. 3 nodes cluster (aerospike-server:5.7.0.24) in k8s. Local network. 1+ billion records in set. In-memory storage.

clientPolicy := aerospike.NewClientPolicy()
clientPolicy.Timeout = 10 * time.Second
clientPolicy.IdleTimeout = 20 * time.Second

sp := aerospike.NewScanPolicy()
sp.RecordQueueSize = 5000
sp.IncludeBinData = false
sp.MaxRetries = 10
sp.SleepBetweenRetries = time.Second

recordset, err := aeroClient.ScanAll()

Reading results in 10 threads. After processing 560 mlns records got error:

ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: A0 10.206.195.132:3000: network error.

/go/pkg/mod/github.com/aerospike/aerospike-client-go/v6@v6.14.0/connection.go:96 github.com/aerospike/aerospike-client-go/v6.errToAerospikeErr()
/go/pkg/mod/github.com/aerospike/aerospike-client-go/v6@v6.14.0/connection.go:262 github.com/aerospike/aerospike-client-go/v6.(*Connection).Read()
/go/pkg/mod/github.com/aerospike/aerospike-client-go/v6@v6.14.0/buffered_connection.go:92 github.com/aerospike/aerospike-client-go/v6.(*bufferedConn).readConn()
/go/pkg/mod/github.com/aerospike/aerospike-client-go/v6@v6.14.0/buffered_connection.go:106 github.com/aerospike/aerospike-client-go/v6.(*bufferedConn).read()
/go/pkg/mod/github.com/aerospike/aerospike-client-go/v6@v6.14.0/multi_command.go:250 github.com/aerospike/aerospike-client-go/v6.(*baseMultiCommand).readBytes()
/go/pkg/mod/github.com/aerospike/aerospike-client-go/v6@v6.14.0/multi_command.go:202 github.com/aerospike/aerospike-client-go/v6.(*baseMultiCommand).parseKey()
/go/pkg/mod/github.com/aerospike/aerospike-client-go/v6@v6.14.0/multi_command.go:292 github.com/aerospike/aerospike-client-go/v6.(*baseMultiCommand).parseRecordResults()
/go/pkg/mod/github.com/aerospike/aerospike-client-go/v6@v6.14.0/multi_command.go:174 github.com/aerospike/aerospike-client-go/v6.(*baseMultiCommand).parseResult()
/go/pkg/mod/github.com/aerospike/aerospike-client-go/v6@v6.14.0/command.go:2745 github.com/aerospike/aerospike-client-go/v6.(*baseCommand).executeAt()
/go/pkg/mod/github.com/aerospike/aerospike-client-go/v6@v6.14.0/command.go:2570 github.com/aerospike/aerospike-client-go/v6.(*baseCommand).execute()
/go/pkg/mod/github.com/aerospike/aerospike-client-go/v6@v6.14.0/multi_command.go:415 github.com/aerospike/aerospike-client-go/v6.(*baseMultiCommand).execute()

I executed app many times, it's scanning normal until 560 mlns and then always breaks at the same point. So full scan never finished. Cluster is stable, all nodes are alive. Try doing scan at different time when cluster is not under high load.

odinsy commented 7 months ago

+1 Same problem for us

merlindeep commented 7 months ago

We have encountered the same issue

aerospike / aerospike-client-go

Could you clarify timeouts for the Query? #413