aws / aws-sdk-js-v3

Modularized AWS SDK for JavaScript.
Apache License 2.0
3.06k stars 573 forks source link

ScanCommand with no Limit set only ever reads one page of data #6043

Open mn-prp opened 5 months ago

mn-prp commented 5 months ago

Checkboxes for prior research

Describe the bug

Using ScanCommand without a Limit results in only one page of data being read and no LastEvaluatedKey being returned to resume at the next page. The documentation clearly suggests that if the 1mb page limit is hit, we should get a LastEvaluatedKey in the response.

I am (now) aware that this functionality can be achieved using paginateScan, but I think it is still a bug in ScanCommand as a standalone.

SDK version number

@aws-sdk/lib-dynamodb@npm:3.565.0

Which JavaScript Runtime is this issue in?

Node.js

Details of the browser/Node.js/ReactNative version

Node v20.12.0

Reproduction Steps

Here is what I wrote. Note from my comment below that adding or removing the Limit changes the behavior of this code, even though I do not expect it to.

const scanFrom = (lastEvaluatedKey?: Record<string, any>) =>
      client.send(
        new ScanCommand({
          TableName: this.tableName,
          FilterExpression: 'begins_with(sk, :sk)',
          ExpressionAttributeValues: {
            ':sk': sk
          },
          ExclusiveStartKey: lastEvaluatedKey,
          // Limit: 1000 <-- if we uncomment this line, and set it to anything other than 0 or undefined, it works, otherwise only one page of data returned with no ExclusiveStartKey
        })
      )

    let results: ScanCommandOutput['Items'] = []

    let page = await scanFrom(undefined)

    while (page.LastEvaluatedKey !== undefined) {
      results = results.concat(page.Items ?? [])
      page = await scanFrom(page.LastEvaluatedKey)
    }

    return results

Observed Behavior

As my comment indicates, adding or removing the Limit changes the behavior such that either a LastEvaluatedKey is returned or not, and if Limit is missing, we always only get one page of data (not the complete scan results after paginating).

Expected Behavior

This is the behavior I was trying to achieve (i.e., get everything from the table matching the filter expression), but using the raw ScanCommand instead:

    const paginator = paginateScan(
      { client },
      {
        TableName: this.tableName,
        FilterExpression: 'begins_with(sk, :sk)',
        ExpressionAttributeValues: {
          ':sk': sk
        }
      }
    )

    let results: ScanCommandOutput['Items'] = []
    for await (const page of paginator) {
      results = results.concat(page.Items ?? [])
    }

    return results

Possible Solution

No response

Additional Information/Context

No response

aBurmeseDev commented 4 months ago

Hi @mn-prp - thanks for reaching out.

I'm not able to reproduce it on my end and wanted to verify a few things with you. The expected behavior is that Limit is used for non-pagination requests whereas pageSize can be used when you want to specify results per page.

If you'd like to limit the total returned results, here's what I'd do:

const paginator = paginateScan(
        {
            client: DDBClient,
            pageSize: 1,
        },
        {
            TableName: tableName,
            FilterExpression: 'begins_with(sk, :sk)',
            ExpressionAttributeValues: {
                       ':sk': sk
           }          
        }
    );
    const LIMIT = 2;
    let count = 0;

    for await (const page of paginator) {
        if (++count >= LIMIT) {
            break;
        }
    }
}

Hope that makes sense but let me know if you have any further questions. Best, John

mn-prp commented 4 months ago

Thanks for looking into this. My use case is to retrieve all items, without any (final) pagination, since the goal of the query is to sum some values across the whole table for an internal analytics job that is run periodically.

I don't have the time to try to create a reproduction, since it obviously depends on a populated database, but I can comment on the context a bit. We have two environments, one staging and more production, and the staging environment has a table ~1mb in size, whereas production is ~5mb. The "total" values in both tables were known to be increasing, but whereas the staging environment did continue to increase, we noticed that the total of the production database query was stagnating at around some value (let's say 200,000 -- we were seeing totals at 210,000 then 209,000 then 197,000 then 201,000 etc.) This led me to think there was a page size that was being hit.

So I went and looked at the sdk call, which was the first snippet copied in my issue. By adding the Limit: 1000 I got the expected much-higher total. Then I tried Limit: 50 and again, the expected much higher value (but with many more network requests, as my logging showed). But when I removed the line, the value dropped back down to the ~200,000 and my logging revealed only one loop was being executed, with no LastEvaluatedKey.

Is it expected that Limit is required to read the full data set read from a multi-megabyte table? I would have expected that not setting Limit would either (a) retrieve the max page size and return a LastEvaluatedKey or (b) make the sdk continue fetching the next page automatically (and appending it to the result set so far) until there were no more pages.