Aerospike scan returns error "command execution timed out on client: See `Policy.Timeout`"

chengc-sa commented 1 year ago

I am wondering which parameter I should config to make it less likely to time out? I don't see a Timeout field in the ScanPolicy, I tried to increase ScanPolicy.SocketTimeout to 5 minutes, and this error still occurs.

PS: I set the MaxRetries to 10000

khaf commented 1 year ago

Can you elaborate a bit about your use case? How big is the namespace/set you are scanning? Do you have a particularly unstable network connection to the cluster? Can't you resume your scan by passing the same PartitionFilter? (PartitionFilters are basically a cursor. If you pass them again to the Scan command, they will resume if the scan was not completed)

chengc-sa commented 1 year ago

@khaf The namespace is about 110 GB big, the sets that are often timed out are about 900 MB / 7 million records, 200 MB / 2 million records, 500 MB / 3.5 million records, and 500 MB / 2 million records respectively. The network connection should be pretty stable since both clients and servers are hosted on AWS and connected within the same VPC. The errors occurred in the middle of the scan from the <-chan *aerospike.Result inside the *aerospike.Result.Err. In the meantime, I am also seeing EOF errors occurring from the same channel, are they related?

un000 commented 1 year ago

@khaf The similar thing after I've updated from 6.4 to 6.10

aerospike version: Aerospike Community Edition build 5.6.0.5

error running query: error iterating over records: ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB94D643559A1A8 10.10.2.231:3000: network error. Checked the wrapped error for detail
ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB9F1653559A1A8 10.10.2.232:3000: network error. Checked the wrapped error for detail
ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB98E623559A1A8 10.10.2.233:3000: network error. Checked the wrapped error for detail
ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB98E623559A1A8 10.10.2.233:3000: network error. Checked the wrapped error for detail
ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: <nil>: Timeout

I do scan with a Query with a FilterExpression over millions of records. One-record processing takes 200ms-3000ms, but I do it with a multiple goroutines.

    cp := aerospike.NewClientPolicy()
    cp.Timeout = 5 * time.Second
    cp.IdleTimeout = 30 * time.Second
    cp.ConnectionQueueSize = 1024
    cp.MinConnectionsPerNode = 512

    qp := aerospike.NewQueryPolicy()
    qp.IncludeBinData = true
    qp.RecordQueueSize = 16 * 1024
    qp.FilterExpression = aero.ExpEq(aero.ExpDigestModulo(shardCount), aero.ExpIntVal(shardID))

    statement := aero.NewStatement(r.namespace, r.set)

    rs, err := c.client.Query(qp, statement)
    if err != nil {
        return fmt.Errorf("error executing Query: %w", err)
    }

    var closeErr error
    closeOnce := sync.Once{}
    errGr, ctx := errgroup.WithContext(ctx)
    for i := 0; i < 768; i++ {
        errGr.Go(func() error {
            defer closeOnce.Do(func() { closeErr = rs.Close() })
            for result := range rs.Results() {
                if result.Err != nil {
                    return result.Err
                }

                if ctx.Err() != nil {
                    return nil
                }

                if err := processFunc(result); err != nil {
                    return fmt.Errorf("process func returned an error: %w", err)
                }
            }

            return nil
        })
    }

    if err := errGr.Wait(); err != nil {
        return fmt.Errorf("error iterating over records: %w", err)
    }

    if closeErr != nil {
        return fmt.Errorf("error closing records chan: %w", err)
    }

Looks this returns

    if result.Err != nil {
        return result.Err
    }

un000 commented 1 year ago

Also I see the following change https://github.com/aerospike/aerospike-client-go/commit/f0d28189f2f4b76d74b1c99f3ab038827fbe1e80

So what's a behaviour will be, when we get out of retries? How to check if the whole set will be read?

aerospike / aerospike-client-go

Aerospike scan returns error "command execution timed out on client: See `Policy.Timeout`" #396