Robust read while single node lost(reboot)

dronnix commented 2 years ago

Hi!

I have an Aerospike cluster with replication factor 2. When one of the nodes is hard rebooted, we get a Partition not available error on the client batch read. And, then, recordset has already been closed or cancelled when trying to read a large subset of data with scan. Could You please suggest client settings to provide seamless batch read without error during one node is rebooting?

Current read settings:

    scanPolicy.ConcurrentNodes = true
    scanPolicy.IncludeBinData = false
    scanPolicy.ReadModeSC = aero.ReadModeSCAllowUnavailable
    scanPolicy.Priority = aero.HIGH
    scanPolicy.SendKey = true
    scanPolicy.MaxRecords = int64(keysNumber + keysToSkip + 42)
    scanPolicy.TotalTimeout = 8 * time.Second
    scanPolicy.FailOnClusterChange = false
    scanPolicy.ReplicaPolicy = aero.SEQUENCE

    batchPolicy.ConcurrentNodes = 1
    batchPolicy.ReplicaPolicy = aero.SEQUENCE
    batchPolicy.Priority = aero.HIGH
    batchPolicy.ReadModeSC = aero.ReadModeSCAllowUnavailable
    batchPolicy.AllowPartialResults = true
    batchPolicy.TotalTimeout = time.Second
    batchPolicy.MaxRetries = 3

khaf commented 2 years ago

Which version of the client and server are you using?

dronnix commented 2 years ago

Sorry. Server: 5.5.0.7 Client: v4.5.2

khaf commented 2 years ago

That client is very old. The new 'v5' should have resolved these issues. It resides in v5 branch, and you can use it via go modules.

dronnix commented 2 years ago

Thank You!

After the update got errors on node reboot with the same settings:

ResultCode: MAX_RETRIES_EXCEEDED, Iteration: 3, InDoubt: false, Node: <nil>: command execution timed out on client: Exceeded number of retries. See `Policy.MaxRetries`.
ResultCode: MAX_ERROR_RATE, Iteration: 2, InDoubt: false, Node: A0 10.244.36.116:3000: Max errors limit reached for node
ResultCode: MAX_ERROR_RATE, Iteration: 1, InDoubt: false, Node: A0 10.244.36.116:3000: Max errors limit reached for node
ResultCode: MAX_ERROR_RATE, Iteration: 0, InDoubt: false, Node: A0 10.244.36.116:3000: Max errors limit reached for node

khaf commented 2 years ago

If you don't warm up your client, the first few requests will timeout like this while the client is establishing new connections to the server. After that, there should be enough connections available that this issue should not happen unless under heavy load. Is this the case here, or does the issue persist over time?

dronnix commented 2 years ago

Thanks, going to check it.

dronnix commented 2 years ago

Now it works fine in my case.

aerospike / aerospike-client-go

Robust read while single node lost(reboot) #364