aws / aws-sdk-js-v3

Modularized AWS SDK for JavaScript.
Apache License 2.0
3.06k stars 573 forks source link

paginateScan on 16mb table is very slow #4457

Closed paul-uz closed 2 months ago

paul-uz commented 1 year ago

Checkboxes for prior research

Describe the bug

We are using a parallised paginateScan to fetch all records from a table, which is currently ~16mb, with ~4500 items.

I take the table size (in kb), and divide by 1048576 to get the size in mb, and round it up to get the total segments to use. I then use multiple calls, using the segment and totalSegment config options to do the parallisation.

We are seeing cold boot times of 7 seconds, and hot boot times of 4 seconds. This seems extremely slow.

Now, we built in support for spare field sets in our API, using the ProjectionExpression config option. We've noticed that only specifying certain fields, we can get a much faster response time, but we also spotted that when we include some fields, the response time starts creeping back up. These seemed to be objects with multiple levels of nesting, and arrays of objects etc. I think something fishy is going on here.

Does anyone have experience with paginateScan (parallised or not) returning several thousands of items in good time?

SDK version number

"@aws-sdk/client-dynamodb": "^3.229.0", "@aws-sdk/lib-dynamodb": "^3.229.0",

Which JavaScript Runtime is this issue in?

Node.js

Details of the browser/Node.js/ReactNative version

Node 18

Reproduction Steps

Write a paralle paginateScan, and run it on a table with ~4500 items, at about ~16mb table size.

this.parallelScan = async (fields) => {
            const results = [];
            const tableDescription = await this.getTableDescription();
            const tableSize = tableDescription.tableSize === 0 ? 1 : tableDescription.tableSize;
            const totalSegments = Math.ceil(tableSize / 1024 / 1024);
            const items = await Promise.all(Array.from(Array(totalSegments)).map((_, i) => {
                return this.getItems(ENV_VARS.MY_TABLE, i, totalSegments, fields);
            }));
            items.forEach((chunk) => {
                results.push(...chunk);
            });
            results.forEach((result, i) => {
                results[i] = result;
            });
            return results;
        };
        this.getTableDescription = async () => {
            const params = {
                TableName: ENV_VARS.MY_TABLE,
            };
            const command = new client_dynamodb_1.DescribeTableCommand(params);
            const response = await this.client.send(command);
            return {
                tableSize: response.Table.TableSizeBytes,
                itemCount: response.Table.ItemCount,
            };
        };
        this.getItems = async (tableName, segment, totalSegments, fields) => {
            const rows = [];
            const paginatorConfig = {
                client: this.ddbDocClient,
                pageSize: 200,
            };
            const params = {
                TableName: tableName,
                FilterExpression: 'attribute_exists(myID)',
                Segment: segment,
                TotalSegments: totalSegments,
            };
            if (fields) {
                params.ProjectionExpression = 'myID';
                fields.split(',').forEach((field) => {
                    if (field.split('.').length === 2) {
                        const nestedField = field.split('.');
                        params.ProjectionExpression += `,#${nestedField[0]}.#${nestedField[1]}`;
                        params.ExpressionAttributeNames = {
                            ...params.ExpressionAttributeNames,
                            [`#${nestedField[0]}`]: nestedField[0],
                            [`#${nestedField[1]}`]: nestedField[1],
                        };
                    }
                    else {
                        params.ProjectionExpression += `,#${field}`;
                        params.ExpressionAttributeNames = {
                            ...params.ExpressionAttributeNames,
                            [`#${field}`]: field,
                        };
                    }
                });
            }
            const paginator = (0, lib_dynamodb_1.paginateScan)(paginatorConfig, params);
            for await (const page of paginator) {
                rows.push(...page.Items);
            }
            return rows;
        };

Observed Behavior

I expect DDB to be fast. Especially with a parallel paginateScan, on such a small table.

Expected Behavior

DDB is slow.

Possible Solution

No response

Additional Information/Context

No response

RanVaknin commented 1 year ago

Hi @paul-uz ,

Thanks for opening this issue.

A couple of points right off the bat: You have quite a complex setup here, without knowing your schema its kind of hard to understand what you are going for, but Ill give it my best shot.

  1. Im not sure why divvying up the concurrent reads by kb is the approach you went with to choose your number of segments. Can you elaborate?

  2. What is the purpose of this snippet from your code?

         results.forEach((result, i) => {
                results[i] = result;
            });

    It just looks like your assigning the result to itself.

  3. Im not sure why you are using this FilterExpression: 'attribute_exists(myID)'. I dont know what your schema looks like, but to me this looks like a primary key which means it will always exist making this filter costly and redundant.

  4. You are using Scan instead of Query. Even when dividing the reads into segements, you are reading the entire table with every concurrent request. So you are reading 4500 items, 16 times (1 entire table read for each segment) this is very wasteful both in terms of performance and $ billing. If my assumption about 3. is correct, than running a Query instead of a scan is going to be a lot more performant.

  5. With regards to cold / hot boot times, I assume you are talking about Lambda? This is out of the SDK's realm of responsibility. However, @trivikr recently published a blogpost comparing v2 and v3 lambda boot times and how to improve them.

In conclusion, its really hard for me to advise you on setup both because your use case is not clear to me, and also because Im not a dynamodb expert. I can speak to the SDK-isms of using Dynamo, but to get a better idea on how to improve your table's performance Id suggest consulting the Dynamo docs or perhaps opening an internal ticket in the AWS console and ask to be routed to a dynamodb team member for more in-depth questions.

If there's anything else I can do to help, please let me know. Thanks, Ran

paul-uz commented 1 year ago

I am using a segmented scan (see the Segment and TotalSegments config options; I was uder the impression this would only scan a portion of the entire table, meaning I am splitting the table into 16 segments and reading each one once, to read the whole table. Is that not how segments work?

I did in fact try using a Query, and the results where roughly the same.

The code inside paginateScan you are asking about simply takes the chunked results and turns it into a flat array of objects.

You could remove a lot of the code, and use just the paginateScan() AWS method, and you'd probably see the same results, with regards to time to fetch the full table.

RanVaknin commented 1 year ago

Hi @paul-uz ,

Thanks for the follow up. I did check the docs regarding segmentation again, and you are totally right. I only referred to the Scan docs which mentioned that scan reads the entire table every time.

You could remove a lot of the code, and use just the paginateScan() AWS method, and you'd probably see the same results, with regards to time to fetch the full table.

I tried running my own simplified implementation from a local NodeJS env to try and get an isolated picture of the performance of the SDK. Just running a paginateScan:

import { DynamoDBClient } from "@aws-sdk/client-dynamodb";
import { DynamoDBDocumentClient, paginateScan } from "@aws-sdk/lib-dynamodb";

const client = new DynamoDBClient({ region:"us-east-1" });
const ddbDocClient = DynamoDBDocumentClient.from(client);

const tableName = "paginationTable";

const getItemsWithoutFilter = async (segment, totalSegments) => {
    const rows = [];
    const params = {
        TableName: tableName,
        Segment: segment,
        TotalSegments: totalSegments
    };

    const paginator = paginateScan({ client: ddbDocClient, pageSize: 200 }, params);
    for await (const page of paginator) {
        rows.push(...page.Items);
    }
    return rows;
};

// driver code
const fetchConcurrently = async () => {
    const totalSegments = 4;

    const allResults = await Promise.all([
        getItemsWithProjectionAndFilter(0, totalSegments),
        getItemsWithProjectionAndFilter(1, totalSegments),
        getItemsWithProjectionAndFilter(2, totalSegments),
        getItemsWithProjectionAndFilter(3, totalSegments)
    ]);

    const combinedResults = allResults.flat();

    return combinedResults;
};

const startWithoutFilter = Date.now();
const resultsWithoutFilter = await fetchConcurrently();
const endWithoutFilter = Date.now();
console.log(`Time taken: ${endWithoutFilter - startWithoutFilter}ms`);
console.log(`Number of items: ${resultsWithoutFilter.length}`);

Test results:

$ node sample.mjs
Time taken: 1034ms
Number of items: 3314

Personally, 1034ms sounds low to me, which tells me that the core of your issue might be coming from the cold start, so the solution mentioned in the blogpost of bundling your own SDK and uploading to Lambda, might be a good solution.

Let me know your thoughts. Thanks, Ran~

paul-uz commented 1 year ago

What was your total record count and table size?

RanVaknin commented 1 year ago

Hi @paul-uz

Its Item count 3,314 Table size 1.1 megabytes

The problem with simulating this is that dynamo doesn't calculates the total size of the table "on demand". If you setup a new table and load it with items it will take a day or two for the table size to get populated (probably a scheduled job) internally.

Can you give me an idea of what might a single item in your table look like? Like how many columns and data types? That would help mimic the same size and and count.

Thanks, Ran~

paul-uz commented 1 year ago

But you can use describe table to get the live size.

Sadly I cannot share the data. Its a few fields of json objects and other single data types.

The main thing is my table size was 16mb

On Mon, 28 Aug 2023, 01:54 Ran Vaknin, @.***> wrote:

Hi @paul-uz https://github.com/paul-uz

Its Item count 3,314 Table size 1.1 megabytes

The problem with simulating this is that dynamo doesn't calculates the total size of the table "live. If you setup a new table and load it with items it will take a day or two for the table size to get populated (probably a scheduled job)

Can you give me an idea of what might a single item in your table look like? Like how many rows and data types. That would help mimic the same size and and count.

Thanks, Ran~

— Reply to this email directly, view it on GitHub https://github.com/aws/aws-sdk-js-v3/issues/4457#issuecomment-1694834563, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARTIDVUCRGHEIY6F3JAXKXTXXPT5NANCNFSM6AAAAAAVEN7NQY . You are receiving this because you were mentioned.Message ID: @.***>

RanVaknin commented 1 year ago

Hi @paul-uz ,

Not immediately. I set this all up on Friday and ran aws dynamodb describe-table ... and that data wasn't available shortly after the putItem commands.

Ill give it another go on Tuesday with some more complex nested objects and will update you.

Ran~

RanVaknin commented 3 months ago

Hi @paul-uz ,

Sorry for the late response.

I have reproduced this again now with a 16mb table size and still get fairly fast results:

import { DynamoDBClient, DescribeTableCommand, ScanCommand } from "@aws-sdk/client-dynamodb";
import { DynamoDBDocumentClient, paginateScan } from "@aws-sdk/lib-dynamodb";

const client = new DynamoDBClient({ region: "us-east-1" });
const ddbDocClient = DynamoDBDocumentClient.from(client);

const tableName = "TestTable";

const getItemsWithProjectionAndFilter = async (segment, totalSegments) => {
    const rows = [];
    const params = {
        TableName: tableName,
        ProjectionExpression: "#myID, #column2, #column3",
        ExpressionAttributeNames: {
            "#myID": "ID", 
            "#column2": "column2",
            "#column3": "column3"
        },
        FilterExpression: 'attribute_exists(#myID)',    
        Segment: segment,
        TotalSegments: totalSegments
    };

    const paginator = paginateScan({ client: ddbDocClient, pageSize: 200 }, params);
    for await (const page of paginator) {
        rows.push(...page.Items);
    }
    return rows;
};

const fetchConcurrently = async () => {
    const totalSegments = 4;
    const allResults = await Promise.all([
        getItemsWithProjectionAndFilter(0, totalSegments),
        getItemsWithProjectionAndFilter(1, totalSegments),
        getItemsWithProjectionAndFilter(2, totalSegments),
        getItemsWithProjectionAndFilter(3, totalSegments)
    ]);

    const combinedResults = allResults.flat();
    return combinedResults;
};

(async () => {
    console.log("Testing with FilterExpression...");
    const start = Date.now();
    const results = await fetchConcurrently();
    const end = Date.now();
    console.log(`Time taken with FilterExpression: ${end - start}ms`);
    console.log(`Number of items fetched with FilterExpression: ${results.length}`);
})();

/*
Testing with FilterExpression...
Time taken with FilterExpression: 856ms
Number of items fetched with FilterExpression: 4368
*/

You mentioned earlier in the correspondence cold start and warm start times, and didn't address my question about whether youre running your code from Lambda. The terminology you used - "cold boot / warm boot" are not something I would use to describe dynamo but are more related to Lambda's setup time to provision the container in which the lambda is running on. If you are running this paginateScan call through Lambda, then the overhead you are seeing is likely related to lambda, and one way to mitigate this is to bundle minify and provide your own SDK to lambda.

Thanks, Ran~

paul-uz commented 3 months ago

I need to retest as it's been a while.

But can we clear up the scan operation and segments?

Is it simply not worth doing a parallel paginated scan, as with each segment block, the entire table is scanned each time?

This seems like it shouldn't be the case, otherwise what is the point of segments?

Sadly in some cases we can't do a query, as not all our tables are compatible with queryable data. IE we only have a unique primary key and no sort key and need to get all records.

RanVaknin commented 2 months ago

Hey @paul-uz ,

But can we clear up the scan operation and segments?

Is it simply not worth doing a parallel paginated scan, as with each segment block, the entire table is scanned each time?

Yes, as I mentioned earlier, it was my misunderstanding about how segmented scanning works since I referred to the wrong docs. A segmented block would not read the entire table, it would only scan the segment itself.

Here is how I would showcase that in the code:

const getItemsWithProjectionAndFilter = async (segment, totalSegments) => {
    const rows = [];
    const params = {
        TableName: tableName,
        ProjectionExpression: "#myID, #column2, #column3",
        ExpressionAttributeNames: {
            "#myID": "ID", 
            "#column2": "column2",
            "#column3": "column3"
        },
        FilterExpression: 'attribute_exists(#myID)',    
        Segment: segment,
        TotalSegments: totalSegments
    };

    const paginator = paginateScan({ client: ddbDocClient, pageSize: 200 }, params);
    for await (const page of paginator) {
        rows.push(...page.Items);
    }
    console.log(`Segment ${segment}: Fetched ${rows.length} items`);
    return rows;
};

/*
Testing with FilterExpression...
Segment 2: Fetched 1123 items
Segment 0: Fetched 1024 items
Segment 1: Fetched 1082 items
Segment 3: Fetched 1139 items
Time taken with FilterExpression: 869ms
Number of items fetched with FilterExpression: 4368
*/

Sadly in some cases we can't do a query, as not all our tables are compatible with queryable data. IE we only have a unique primary key and no sort key and need to get all records.

The code and reproduction I shared were for Scan and not for Query. Fwiw my own table only has primary key and not sort key.

Thanks, Ran~

github-actions[bot] commented 2 months ago

This issue has not received a response in 1 week. If you still think there is a problem, please leave a comment to avoid the issue from automatically closing.

github-actions[bot] commented 1 month ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs and link to relevant comments in this thread.