files not loaded because they may be in 'locked' state (though they aren't)

tomekzbrozek commented 3 years ago

More than a week ago, my loader unexpectedly stopped COPYing files to Redshift, producing the following errors:

warn: Batch 397beb25-b868-43ec-a013-d60a683fb54f still current after configuration reload attempt 95. Recycling in 200 ms.

☝️ this is retried 100 times in a row -- the log above shows trial number 95.

Then, after 100th try, it throws this error:

error: Unable to write my-s3-path/2021/09/28/events_v4-7f1a88a3-0055-48ff-9d16-043b41b28f3b.json in 100 attempts. Failing further processing to Batch 397beb25-b868-43ec-a013-d60a683fb54f which may be stuck in 'locked' state. If so, unlock the back using `node unlockBatch.js <batch ID>`, delete the processed file marker with `node processedFiles.js -d <filename>`, and then re-store the file in S3

error: Unable to send failure notifications

error: Lambda Redshift Loader unable to write to Open Pending Batch

{
    "errorType": "Error",
    "errorMessage": "error",
    "stack": [
        "Error: error",
        "    at _homogeneousError (/var/runtime/CallbackContext.js:12:12)",
        "    at postError (/var/runtime/CallbackContext.js:29:54)",
        "    at done (/var/runtime/CallbackContext.js:58:7)",
        "    at Object.done (/var/runtime/CallbackContext.js:106:16)",
        "    at /var/task/index.js:523:41",
        "    at /var/task/node_modules/async/dist/async.js:325:20",
        "    at check (/var/task/node_modules/async/dist/async.js:4463:32)",
        "    at /var/task/index.js:347:17",
        "    at Timeout.next [as _onTimeout] (/var/task/node_modules/async/dist/async.js:4457:13)",
        "    at listOnTimeout (internal/timers.js:554:17)"
    ]
}

When I tried to unlock the batch, I got this message:

Batch 397beb25-b868-43ec-a013-d60a683fb54f cannot be unlocked as it is not in 'locked' or 'error' status

We have 9 different configurations in dynamoDB (each configuration refers to a different S3 prefix of files), and it's only files from 2 of these configurations that stopped being loaded (but within those two configurations, all files are impacted by that error).

I'd appreciate any pointers. Seems like a similar issue to this one https://github.com/awslabs/aws-lambda-redshift-loader/issues/23 from 6 years ago, but it's not clear how that one was fixed. Thanks!

IanMeyers commented 3 years ago

Hello,

This means that the batchID that is linked to the s3 prefix was locked for loading by another thread, but then wasn't rotated. This probably means that the thread that should have loaded the files crashed in processPendingBatch before it got to the batch rotation. To fix this, you need to do 2 things:

Assign the prefix a new Batch ID using resetCurrentBatch.js from the command line
Unlock and resubmit the crashed Batch ID using reprocessBatch.js

Sorry that you had this issue, but the above should resolve it.

Thx,

Ian

tomekzbrozek commented 3 years ago

Thanks @IanMeyers , it worked indeed!

awslabs / aws-lambda-redshift-loader

files not loaded because they may be in 'locked' state (though they aren't) #235