Open chengc-sa opened 1 year ago
Can you elaborate a bit about your use case? How big is the namespace/set you are scanning? Do you have a particularly unstable network connection to the cluster? Can't you resume your scan by passing the same PartitionFilter
? (PartitionFilters
are basically a cursor. If you pass them again to the Scan command, they will resume if the scan was not completed)
@khaf The namespace is about 110 GB big, the sets that are often timed out are about 900 MB / 7 million records, 200 MB / 2 million records, 500 MB / 3.5 million records, and 500 MB / 2 million records respectively. The network connection should be pretty stable since both clients and servers are hosted on AWS and connected within the same VPC. The errors occurred in the middle of the scan from the <-chan *aerospike.Result
inside the *aerospike.Result.Err
. In the meantime, I am also seeing EOF
errors occurring from the same channel, are they related?
@khaf The similar thing after I've updated from 6.4 to 6.10
aerospike version: Aerospike Community Edition build 5.6.0.5
error running query: error iterating over records: ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB94D643559A1A8 10.10.2.231:3000: network error. Checked the wrapped error for detail
ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB9F1653559A1A8 10.10.2.232:3000: network error. Checked the wrapped error for detail
ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB98E623559A1A8 10.10.2.233:3000: network error. Checked the wrapped error for detail
ResultCode: NETWORK_ERROR, Iteration: 0, InDoubt: false, Node: BB98E623559A1A8 10.10.2.233:3000: network error. Checked the wrapped error for detail
ResultCode: TIMEOUT, Iteration: 0, InDoubt: false, Node: <nil>: Timeout
I do scan with a Query with a FilterExpression over millions of records. One-record processing takes 200ms-3000ms, but I do it with a multiple goroutines.
cp := aerospike.NewClientPolicy()
cp.Timeout = 5 * time.Second
cp.IdleTimeout = 30 * time.Second
cp.ConnectionQueueSize = 1024
cp.MinConnectionsPerNode = 512
qp := aerospike.NewQueryPolicy()
qp.IncludeBinData = true
qp.RecordQueueSize = 16 * 1024
qp.FilterExpression = aero.ExpEq(aero.ExpDigestModulo(shardCount), aero.ExpIntVal(shardID))
statement := aero.NewStatement(r.namespace, r.set)
rs, err := c.client.Query(qp, statement)
if err != nil {
return fmt.Errorf("error executing Query: %w", err)
}
var closeErr error
closeOnce := sync.Once{}
errGr, ctx := errgroup.WithContext(ctx)
for i := 0; i < 768; i++ {
errGr.Go(func() error {
defer closeOnce.Do(func() { closeErr = rs.Close() })
for result := range rs.Results() {
if result.Err != nil {
return result.Err
}
if ctx.Err() != nil {
return nil
}
if err := processFunc(result); err != nil {
return fmt.Errorf("process func returned an error: %w", err)
}
}
return nil
})
}
if err := errGr.Wait(); err != nil {
return fmt.Errorf("error iterating over records: %w", err)
}
if closeErr != nil {
return fmt.Errorf("error closing records chan: %w", err)
}
Looks this returns
if result.Err != nil {
return result.Err
}
Also I see the following change https://github.com/aerospike/aerospike-client-go/commit/f0d28189f2f4b76d74b1c99f3ab038827fbe1e80
So what's a behaviour will be, when we get out of retries? How to check if the whole set will be read?
I am wondering which parameter I should config to make it less likely to time out? I don't see a
Timeout
field in theScanPolicy
, I tried to increaseScanPolicy.SocketTimeout
to 5 minutes, and this error still occurs.PS: I set the
MaxRetries
to 10000