Appsync GraphQL (backed by DynamoDB) fails to return all filtered data in a single call / AppSync GraphQL fails for large data

BBopanna commented 2 years ago

Before opening, please confirm:

[X] I have searched for duplicate or closed issues and discussions.
[X] I have read the guide for submitting bug reports.
[X] I have done my best to include a minimal, self-contained set of instructions for consistently reproducing the issue.

JavaScript Framework

React

Amplify APIs

GraphQL API

Amplify Categories

auth, storage, function, hosting

Environment information

``` # Put output below this line # Put output below this line System: OS: macOS 12.4 CPU: (8) x64 Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz Memory: 37.61 MB / 16.00 GB Shell: 5.8.1 - /bin/zsh Binaries: Node: 16.15.1 - /usr/local/bin/node Yarn: 1.22.19 - /usr/local/bin/yarn npm: 8.11.0 - /usr/local/bin/npm Browsers: Chrome: 103.0.5060.114 Safari: 15.5 npmPackages: @asseinfo/react-kanban: 2.2.0 => 2.2.0 @aws-amplify/analytics: ^5.2.11 => 5.2.11 @aws-amplify/api: ^4.0.44 => 4.0.44 @aws-amplify/auth: ^4.5.8 => 4.5.8 @aws-amplify/cli: ^9.1.0 => 9.1.0 @aws-amplify/core: ^4.5.8 => 4.5.8 @aws-amplify/interactions: ^4.0.44 => 4.0.44 @aws-amplify/storage: ^4.4.27 => 4.4.27 @aws-amplify/ui: ^3.12.1 => 3.12.1 (2.0.5) @aws-amplify/ui-react: ^3.0.4 => 3.0.4 @aws-amplify/ui-react-internal: undefined () @aws-amplify/ui-react-legacy: undefined () @aws-amplify/xr: ^3.0.44 => 3.0.44 @emotion/cache: 11.7.1 => 11.7.1 (11.9.3) @emotion/react: 11.7.1 => 11.7.1 @emotion/styled: 11.6.0 => 11.6.0 @fullcalendar/daygrid: 5.10.0 => 5.10.0 (5.10.1) @fullcalendar/interaction: 5.10.0 => 5.10.0 @fullcalendar/react: 5.10.0 => 5.10.0 @fullcalendar/timegrid: 5.10.0 => 5.10.0 @material-ui/core: ^4.12.4 => 4.12.4 @material-ui/icons: ^4.11.3 => 4.11.3 @material-ui/lab: ^4.0.0-alpha.61 => 4.0.0-alpha.61 @mui/icons-material: 5.4.1 => 5.4.1 @mui/material: 5.4.1 => 5.4.1 @mui/styled-engine: 5.4.1 => 5.4.1 (5.8.0) @react-jvectormap/core: 1.0.1 => 1.0.1 @react-jvectormap/world: 1.0.0 => 1.0.0 @testing-library/jest-dom: 5.16.2 => 5.16.2 @testing-library/react: 12.1.2 => 12.1.2 @testing-library/user-event: 13.5.0 => 13.5.0 @types/chroma-js: 2.1.3 => 2.1.3 @types/dropzone: 5.7.4 => 5.7.4 @types/jest: 27.4.0 => 27.4.0 (28.1.4) @types/node: 16.11.21 => 16.11.21 (18.0.0) @types/react: 17.0.38 => 17.0.38 (17.0.14) @types/react-dom: 17.0.11 => 17.0.11 @types/react-flatpickr: 3.8.5 => 3.8.5 @types/react-table: 7.7.9 => 7.7.9 @types/uuid: 8.3.4 => 8.3.4 aws-amplify: ^4.3.26 => 4.3.26 chart.js: 3.4.1 => 3.4.1 chart.js-auto: undefined () chart.js-helpers: undefined () chroma-js: 2.4.2 => 2.4.2 dropzone: 5.9.2 => 5.9.2 flatpickr: 4.6.9 => 4.6.9 (4.6.13) formik: 2.2.9 => 2.2.9 html-react-parser: 1.4.8 => 1.4.8 prettier: 2.5.1 => 2.5.1 query-string: ^7.1.1 => 7.1.1 react: 17.0.2 => 17.0.2 react-chartjs-2: 3.0.4 => 3.0.4 react-dom: 17.0.2 => 17.0.2 react-flatpickr: 3.10.7 => 3.10.7 react-github-btn: 1.2.1 => 1.2.1 react-images-viewer: 1.7.1 => 1.7.1 react-quill: 1.3.5 => 1.3.5 react-router-dom: 6.2.1 => 6.2.1 react-scripts: 5.0.0 => 5.0.0 react-spinners: ^0.13.3 => 0.13.3 react-table: 7.7.0 => 7.7.0 stylis: 4.0.13 => 4.0.13 stylis-plugin-rtl: 2.0.2 => 2.0.2 typescript: ^4.7.4 => 4.7.4 uuid: 8.3.2 => 8.3.2 (3.4.0, 3.3.2) web-vitals: 2.1.4 => 2.1.4 yup: 0.32.11 => 0.32.11 npmGlobalPackages: @aws-amplify/cli: 8.5.1 corepack: 0.10.0 npm: 8.11.0 typescript: 4.7.4 yarn: 1.22.19 ```

Describe the bug

Appsync, backed by DyanamoDB, does not return all the filtered data in a single call.

AppSync API's have filter conditions to search for specific data sets, however the way AppSync GraphQL filters are applied is to the first 1000 records and any matching records in that 1000 records are returned. This way, if there was a 1million (1,000,000) records in a table 1,000,000/1000 = 1000 call in worst case, pagination after pagination with nextToken !!!!!- NOT PERFORMANT AT ALL, NO one will use this kind of API and the App will fail badly in performance.
Solution that AWS support team offered is "Use @Searchable" on the entity, this will configure OpenSearch Service which is powered by Elastic Search.
- IT IS VERY EXPENSIVE - depending on the data size of your app's table you will need different size ec2 (?? not sure) instances - it costed 12$ just for 4 days in a test environment with not a lot of data at all!!!
- Even if we were to ignore the cost (NOT possible at bill, this will burn the pocket!!!) - there is a limit - the search size limit with @Searchable is 10,000 now for the same data of 1 Million records in the table -> 1,000,000 / 10,000 = 100 (yes one hundred!!!!) calls, pagination after pagination with nextToken - Not performant at all again, no one can use such poor searches that too with a very high financial cost along with an impossible performance.

@AWS team please advise on how to go about in cases like these where there are 1Million+ records in different tables - this seems like a very real world scenario. In traditional RDBMS, good indexes will get this job done without hiccups and the search/where condition is applied to the entire data set rather than a limit/sub set size of data like in AppSync. Please recommend the best solution.

Expected behavior

Should be able to get all filter/search condition matched data in 1 or couple of calls to keep the performance of the application using Appsync optimal.

Reproduction steps

NA

Code Snippet

// Put your code below this line.

Log output

``` // Put your logs below this line ```

aws-exports.js

No response

Manual configuration

No response

Additional configuration

No response

Mobile Device

No response

Mobile Operating System

No response

Mobile Browser

No response

Mobile Browser Version

No response

Additional information and screenshots

No response

BBopanna commented 2 years ago

@ErikCH - Please note, thanks.

iartemiev commented 2 years ago

@BBopanna for large data sets we recommend you utilize indexes based on your data access patterns to avoid having DynamoDB perform scan operations against your entire table. Using indexes will allow AppSync to function much more efficiently.

Please see these guides https://docs.amplify.aws/cli/graphql/examples-and-solutions/ https://docs.amplify.aws/cli/graphql/data-modeling/#configure-a-primary-key

BBopanna commented 2 years ago

@iartemiev - we are aware of the index, can you comment on the issue - that multiple calls will have to be made to get to all the data which matches a filter condition since the filter condition is applied to the each page of data and not to the entire data set. Please look into the example give above in the description - Index will not solve the issue as what we have understood.

iartemiev commented 2 years ago

For us to better assist you, could you provide a concrete use case you want to implement with AppSync? What kind of application are you building and what are its high-level requirements? What’s the data model? What are your app’s data access patterns?

Regarding point 1. in your post:

AppSync GraphQL filters are applied is to the first 1000 records and any matching records in that 1000 records are returned

Can you point me to where you're getting this information?

if there was a 1million (1,000,000) records in a table 1,000,000/1000 = 1000 call in worst case, pagination after pagination with nextToken

This behavior can vary widely and depends on your data model and resolver implementation. What are you basing these calculations on? Can you share your resolver VTL or point me to the AppSync docs you’re referencing to arrive at the 1000 call figure?

BBopanna commented 2 years ago

@iartemiev - can we please get on a call so that we would explain the use case and also walk you through the response we have received from the internal support team of aws which lead us to the above numbers. If yes, please share you mail id.

BBopanna commented 2 years ago

@iartemiev ?

iartemiev commented 2 years ago

I would prefer that we communicate via GitHub. That way, other customers who have similar questions can benefit from the discussion.

Alternatively, feel free to post your questions in the #graphql-help channel on the AWS Amplify Discord Server. It's actively monitored by experienced community members as well as AWS engineers.

BBopanna commented 2 years ago

Sure @iartemiev. Kindly bear with the long writeup below -

Requirement -

App will provide location based search capabilities
Location will be searched through its postalCode
Example - Search for "Burger" at postalCode : 30005
Entity/model will be like below (example only)

type Food
@model(subscriptions: { level: off })
@auth(
        rules: [
            { allow: public, operations: [read], provider: iam }
            { allow: private, operations: [read], provider: iam }
            { allow: owner, ownerField: "owner", operations: [read, create, update, delete], provider: userPools }
        ]
    )
{
    id: ID!
    @primaryKey

    name: String!  # name field for storing the food name
    @index(
        name: "byNameByPostalCode"
        sortKeyFields: ["postalCode"]
        queryField: "listFoodByName"
    )

    postalCode : Int  # postalCode field for knowing in this locality it is available
    @index(
        name: "byPostalCodeByName"
        sortKeyFields: ["name"]
        queryField: "listFoodByPostalCode"
    )
}

Number of records on this model backed DynamoDB will be 1 Million + , data will be randomly distributed by postalCode. Example : Burger for 30005 will 1st record, 500,000 record and say 1,000,000 record (just and example, point is data is very randomly stored in the table, no ordering)
Options to Search for ALL burgers available at 30005 postal code :

Option A - Use the regular list query and filter by postalCode. Limit is 1000 by default as per AWS support ticket response we have got, pasted below. We will get the 1st Burger record with nextToken, keep using the nextToken keep making a whole of requests (to be specific 499 requests ) to get the record at 500,000 the position, till then we get empty set with nextToken. And to get to the next matching record another 499 requests. so in all 998 request to get to 3 matching records.

Option B - Use @Searchable on the model. This will add Open Search with Elastic search - this will increase the limit from 1000 to 10,000. Everything else is same a Option A reduces the problem by order of 10. So now we have to make 98 queries instead of 998.

98 queries is also too much from performance - time stand point and no will use a service with this kind of latency., along with @Searchable adding additional cost - this too is not an IDEAL solution. So what is the recommended solution to search through large data like this ???

BBopanna commented 2 years ago

Excerpts from conversations with the AWS premium support team is as below, after sanitising some content -

From the case notes, I understand that you have large amount of data and would like to know the best way to get the records with filter expression on list query and to avoid multiple queries with next token. Additionally you would like to know alternatives for the same. Please correct me, if I have misunderstood your query.

Firstly I would like to explain that, when using a list query with AppSync with DynamoDB it will perform a Scan[1] operation on the table and return X amount of items that match the filter expression based on the number of evaluated items. A Scan operation will simply check the first X amount of items in the table and then return which ever items match the filter expression in the first X items. This value of X is determined by the limit variable passed into the query or if no limit is supplied in the query it will use the default limit configured in the request mapping template. This default limit for example for an Amplify generated API will be 100, which means that the scan will check only the first 100 items and leave the rest alone. The maximum value for the limit variable for a GraphQL API is 1000. if you run a query and your items are not within the first X evaluated items the request will return either too few items, or none at all, instead of the expected result which would be all items.

Hence I would like to list the possible approaches to perform list query operation with filter expression on large amounts of data.

1) As mentioned previously the number of items to be evaluated by DynamoDB is determined by the limit variable. It is important to remember this limit variable represents the number of items to evaluated and not the number of items that will be returned in a given query. If you want to evaluate more items you can simply change the default limit in the mapping template or pass in a limit variable into the List query like below:

listFoods(limit:1000, filter: {postalCode: {eq: 30005}}) {
    items {
      id
      name
    }
  }

2) The second option is to use pagination[2] as mentioned you. Pagination is the process of separating print or digital content into discrete pages. For AppSync this simply means breaking down our list queries into separate subsequent requests which each request will return its own data which we can combine to get the full result. Pagination with AppSync can be done by adding a nextToken field to your query.

 listFoods(limit:1000, filter: {postalCode: {eq: 30005}}) {
    items {
      id
      name
    }
    nextToken
  }

The nextToken field represents the last evaluated key in the query. The nextToken value returned by the query will either be 'null' or a long string starting with "ey" and ending with "=". If the value is null this means that there are no more items to be evaluated in the table. If a non-null result is returned this means that there are more items in the table to evaluate.

This nextToken value is the representation of where the previous query finished scanning the table and if we want to get the rest of items in the table, we would run a subsequent query passing in the nextToken as a query variable. We can then continue doing this until the value of the nextToken returned is null meaning we have evaluated all items in the table.

listFoods(limit:1000, filter: {postalCode: {eq: 30005}}, nextToken:"ej……=") {
    items {
      id
      name
    }
    nextToken
  }

3) You can also consider indexing your data to improve performance. You can create a global secondary index[3] for the field to query against. For example, in Amplify, you can use the @index directive to define secondary indexes: [+]https://docs.amplify.aws/cli/graphql/data-modeling/#configure-a-secondary-index

However Indexed queries also return a maximum of 1 MB of data at a time. For larger indexed queries, you need to use pagination again with help of next token.

4)For effective searching against data you need to use @searchable.The reason the search query works is that ElasticSearch can evaluate more items than DynamoDB which is 10,000 records. [+]https://docs.amplify.aws/cli/graphql/search-and-result-aggregations/

BBopanna commented 2 years ago

Here is another very similar ticket which was closed without being given a good answer - https://github.com/aws-amplify/amplify-cli/issues/4166

BBopanna commented 2 years ago

Its been 5 days and no response on this, can we please get a response ?

iartemiev commented 2 years ago

@BBopanna, thank you for your patience and for providing the detailed information about your use case.

First off, I’d like to point out that the limit property does not have a maximum value. The only relevant limitation is the 1 MB result set per call size limit from DynamoDB (DDB). However, with the schema you provided above, I would estimate your average item size to be around 100-150 bytes, so you would be able to retrieve ~ 6.6 - 10k items in a single query operation without using pagination at all.

For your use case I recommend querying on your Global Secondary Index (GSI), which you’re already creating in your schema with the @index directive.

Amplify will generate the following 2 query statements for your schema in src/graphql/queries.js: listFoodByName and listFoodByPostalCode.

If you want to efficiently query for a particular food at a given postal code, you would then use

listFoodByPostalCode(postalCode: 30005, name: {eq: "Burger"}, limit: 10000) {
  items {
    id
    name
    postalCode
    createdAt
    updatedAt
  }
 }

The resolver mapping template generated for you by Amplify will produce a DDB Query mapping document from this request. The DDB Query will perform a direct look up of your data based on the Partition Key (postalCode) and Sort Key (name).

This will return all of the records that match these parameters, up to 10k matches (the value of limit) in a single query operation. Even if you have millions of items in your DDB table, as long as you’re expecting less than 10k results back, you will get them all back with a single request without pagination and without performing multiple calls to DDB.

Just to reiterate, using this example:

Number of records on this model backed DynamoDB will be 1 Million + , data will be randomly distributed by postalCode. Example : Burger for 30005 will 1st record, 500,000 record and say 1,000,000 record (just and example, point is data is very randomly stored in the table, no ordering)

You will get all of these Burger records back with a single GraphQL Query using listFoodByPostalCode.

BBopanna commented 2 years ago

Thankyou @iartemiev - on the max value for limit - Why is your support team giving wrong information ?(wont give the caseId or the name of the agent here, dont want to throw them under the bus)

So what is the purpose of the limit variable ? to specify a lower limit but not an upper limit ?

We will try this (with a reasonably large data set) and revert - please keep this ticket open (for atleast a week, till we revert back)

iartemiev commented 2 years ago

So what is the purpose of the limit variable ? to specify a lower limit but not an upper limit ?

It's an upper limit. In the example I provided above if we decrease the limit to 10 listFoodByPostalCode(postalCode: 30005, name: {eq: "Burger"}, limit: 10 - the query will return up to 10 items.

The important distinction here is between executing a GQL query that performs a DDB Scan with a filter vs. executing a GQL query that performs a DDB Query operation. I'll provide a detailed comparison of these operations below.

(Note that the 1 MB result size limit applies to scans as well as index queries. This is a DynamoDB limit (independent of AppSync). You can find a list of all the DynamoDB service limits here and the AppSync service limits here.)

Scan

If we execute this GQL query with a filter expression

listFoods(limit: 10000, filter: {and:[
  {postalCode: {eq: 30005}}, 
  {name: {eq: "Burger"}} 
]}) {
  items {
    id
    name
  }
}

The request resolver mapping template generated for you by Amplify will produce a Scan mapping document

{
  "version": "2018-05-29",
  "operation": "Scan",
  "limit": 10000,
  "filter": {
    "expression": "((#name = :and_0_name_eq) AND (#postalCode = :and_1_postalCode_eq))",
    "expressionNames": {
      "#name": "name",
      "#postalCode": "postalCode"
    },
    "expressionValues": {
      ":and_0_name_eq": {
        "S": "Burger"
      },
      ":and_1_postalCode_eq": {
        "N": 30005
      }
    }
  }
}

In this case DynamoDB will read up to limit records from the table, creating a result set of 10k records first. Afterwards, it removes records from that result set that do not match the filter expression. So the number of records you get back can be anywhere from 0 to limit depending on the filter expression you specified.

Essentially this operation works exactly as you described in your original post, but you can adjust the limit variable to any value. E.g. you could set the limit to 1000000 and DDB will scan through 1 million records before applying the filter expression to them. After applying the filter expression, if the resulting document size is over 1 MB, you will need to paginate through the results to get all of the matching items back.

Let's say we have the following contents in our table

id	name	postalCode
1	Taco	10005
2	Burger	70005
3	Burger	30005
4	Steak	30005
5	Soup	30005
6	Salad	30005
7	Pasta	30005
8	Burger	40005
9	Rice	30005
10	Burger	30005

If we execute

listFoods(limit: 5, filter: {and:[
  {postalCode: {eq: 30005}}, 
  {name: {eq: "Burger"}} 
]}) {
  items {
    id
    name
    postalCode
  }
  nextToken
}

DDB will scan through the first 5 items, then apply our filter expression and return the matching result:	id	name	postalCode
3	Burger	30005

Along with a nextToken value that we can pass to the next invocation of our GQL query.

Pros:

Flexibility - you can specify a filter expression on any fields in your type, regardless of whether they are indexed or not. You don't need to change your data model when you change the filter expression.

Cons:

Efficiency - for very large data sets this will not result in the most efficient operation because DDB will have to read every record in your table before applying the filter expression.
Cost - your RCU usage will be calculated based on the total number of records DDB scans through before applying the filter expression. This is negligible for smaller tables, but important to keep in mind as your data set increases to millions of records.

Alternative:

If this is a common access pattern for your application, you can make it more efficient and cost effective by performing an index query.

Query

Once we've established that a common data access pattern for our application involves fetching data by postalCode and by name or by a combination of the two, we can create GSIs on these fields to optimize our queries. When using Amplify this is accomplished through the @index directive just like you have in your schema.

This enables us to perform index queries on our table.

listFoodByPostalCode(postalCode: 30005, name: {eq: "Burger"}, limit: 10000) {
  items {
    id
    name
    postalCode
  }
}

Note that we're not using a filter expression here.

The request resolver mapping template generated for you by Amplify will produce a Query mapping document

{
  "version": "2018-05-29",
  "operation": "Query",
  "limit": 10000,
  "query": {
    "expression": "#postalCode = :postalCode AND #sortKey = :sortKey",
    "expressionNames": {
      "#postalCode": "postalCode",
      "#sortKey": "name"
    },
    "expressionValues": {
      ":postalCode": {
        "N": "30005"
      },
      ":sortKey": {
        "S": "Burger"
      }
    }
  },
  "index": "byPostalCodeByName",
  "scanIndexForward": true
}

Unlike a filter expression, the query expression is used to retrieve the initial result set up to limit records. So we will retrieve up to 10k items from the GSI that match on our Partition Key - postalCode and our Sort Key - name.

Using the same simplified data set

id	name	postalCode
1	Taco	10005
2	Burger	70005
3	Burger	30005
4	Steak	30005
5	Soup	30005
6	Salad	30005
7	Pasta	30005
8	Burger	40005
9	Rice	30005
10	Burger	30005

If we execute

listFoodByPostalCode(limit: 5, postalCode: 30005, name: {eq: "Burger"}) {
  items {
    id
    name
    postalCode
  }
  nextToken
}

We will get back both records

id	name	postalCode
3	Burger	30005
10	Burger	30005

Pros:

Efficiency - This will be a more efficient operation for large data sets, as DDB looks up records by PK (and optionally SK) directly, instead of examining every record in the table.
Cost - your RCU usage will be calculated based on the total number of records DDB accesses, i.e. only the items that match the query expression

Cons:

Flexibility - if your data access patterns or data model changes, you may have to modify your GSIs in order to get maximum benefit from indexed queries.

Note - you can combine query and filter expressions as well. It doesn't have to be one or the other. In that case, DynamoDB will create the initial result set based on the query expression, then apply the filter to it, removing any items that do not match the filter expression from the final result set.

E.g. if you add the field rating to your schema:

type Food @model
{
  id: ID! @primaryKey

  name: String!
  @index(
    name: "byNameByPostalCode"
    sortKeyFields: ["postalCode"]
    queryField: "listFoodByName"
  )

  postalCode : Int!
  @index(
    name: "byPostalCodeByName"
    sortKeyFields: ["name"]
    queryField: "listFoodByPostalCode"
  )

  rating: Int
}

And you wanted to retrieve all the items for a particular food at a location and filter them by rating > 3, you could use:

listFoodByPostalCode(postalCode: 30005, name: {eq: "Burger"}, filter: {rating: {gt: 3}}, limit: 10000) {
  items {
    id
    name
    postalCode
  }
}

DDB would first retrieve up to 10k records by PK and SK, then remove all items from the result set where the rating is 3 or below.

BBopanna commented 2 years ago

Wow @iartemiev - that answer is very comprehensive - one of the best answers ever in github imo.

Very clear @iartemiev - we will try this on our app and post an update.

If there is anyway to promote this answer or add this use case to your amplify appsync docs - many many more folks will find it very useful.

Sincere gratitude for such detailed answer! Damn!

BBopanna commented 2 years ago

Now want to extend this discussion a bit further -

Lets take order entity, use case - customer searched for burger at 30005 with rating > 3 , picks one of the listings and places the order. Sample order entity like below - THIS ENTITY WILL CONTAIN THE LARGEST AMOUNT OF DATA IN OUR APP SINCE MANY MANY ORDER RECORDS ARE CREATED DAILY, HYPOTHETICALLY 10,000+ ORDERS ADDED DAILY. Now most of the case customer and business cares FOR the upcoming order, of course there will be need to show order history too in a different part of the app.

What kind for entity modelling can you recommend for this case such that we have good retrieval performance for super large data sets like this ? Also we do not mind running a scheduled task to move the orders from yesterday/previous date to another storage to keep this entity light, do we need another entity say orderHistory ? If yes then that table too will have ginormous amount of data since we keep moving to it daily - we plan to keep order history for 5 years. Looking forward for you recommendation.

type Order @model
{
  id: ID! @primaryKey

  customerId: String!
  @index(
    name: "byCustomerIdByTime"
    sortKeyFields: ["orderTime"]
    queryField: "listOrdersByCustomeByOrderTime"
  )

  orderTime : AWSDateTime!
  @index(
    name: "byOrderTimeByCustomerId
    sortKeyFields: ["orderTime"]
    queryField: "listOrderTimeByCustomerId"
  )

  more fields.....
}

yashBrahma commented 2 years ago

Hi @iartemiev, we tried GSI to retrieve records and we were able to retrieve records as expected. Thank you for the solution.

Can you please answer the above query?

iartemiev commented 2 years ago

I’m glad to hear the suggestion worked for you!

Even if you’ve got 10k+ records generated per day you will accumulate ~4 million records per year. We have customers with billions of items in their DDB tables who continue to experience consistent performance at that scale. Here are a couple of case studies to check out: https://aws.amazon.com/solutions/case-studies/duolingo-case-study-dynamodb/ and https://aws.amazon.com/solutions/case-studies/amazon-primevideo-dynamodb/. I would not worry about moving the data out of your table any time soon in order to accommodate your primary use case. As long as you have good indexes on your data and avoid unnecessary table scans for common query patterns, you should not need to do anything extra.

That being said, if there’s a business need for performing analysis and business intelligence on your Order data, you may find value in warehousing your Order data in Amazon Redshift. Check out the following guide: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/RedshiftforDynamoDB.html. I definitely don’t think this is something that you have to do early on, but it may be worth considering down the road if such a need arises.

For our primary use case we’ll start with the access patterns, as always. It appears we have the following two:

Retrieve in-progress/pending order(s) for a given customer
Retrieve all orders for a given customer based on a date range

For (1) we need to have a way to track the status of an order. I’ll add a status field to the type along with an enum for order statuses.

Amplify automatically adds the createdAt and updatedAt timestamp fields to each model. The values for these fields are set serverside in the resolvers (createdAt and updatedAt are first set in the create mutation resolver and updatedAt is set in the update mutation resolver). I’ll use the createdAt field as the Sort Key for one of our GSIs, but if you prefer to use your own timestamp field, that’s totally fine as well.

type Order @model {
  id: ID! @primaryKey
  customerId: String!
    @index(
      name: "byCustomerIdByTime"
      sortKeyFields: ["createdAt"]
      queryField: "listOrdersByCustomerByCreatedAt"
    )
    @index(
      name: "byCustomerIdByStatus"
      sortKeyFields: ["status"]
      queryField: "listOrdersByCustomerByStatus"
    )
  createdAt: AWSDateTime
  status: OrderStatus! @default(value: "pending") # @default isn't necessary here, just using it for demonstration
}

enum OrderStatus {
  pending
  completed
  cancelled
}

Here are some examples of index queries we are able to perform with this schema

Retrieve all orders for a given customer from the last 30 days up to 100 orders at a time:

listOrdersByCustomerByCreatedAt(customerId: "1", createdAt: {gt: "2022-07-03"}, limit: 100) {
  items {
    id
    customerId
    status
    createdAt
    updatedAt
  }
  nextToken
}

Retrieve all pending orders for a given customer

listOrdersByCustomerByStatus(customerId: "1", status:{eq:"pending"}) {
  items {
    id
    customerId
    status
    createdAt
    updatedAt
  }
}

Alternatively, if you’d like to treat the status field as a range attribute, you can type it as an Int instead

type Order @model {
  id: ID! @primaryKey
  customerId: String!
    @index(
      name: "byCustomerIdByTime"
      sortKeyFields: ["createdAt"]
      queryField: "listOrdersByCustomerByCreatedAt"
    )
    @index(
      name: "byCustomerIdByStatus"
      sortKeyFields: ["status"]
      queryField: "listOrdersByCustomerByStatus"
    )
  createdAt: AWSDateTime
  status: Int! @default(value: 0) # @default isn't necessary here, just using it for demonstration
}

In your application you can represent the value to correspond to a particular status, e.g. 0 = “pending”, 1 = “completed”, 2 = “cancelled”, etc.

This will give you the ability to query for combinations of statuses, e.g.

Retrieve all orders that are not pending

listOrdersByCustomerByStatus(customerId: "1", status:{gt:0}) {
  items {
    id
    customerId
    status
    createdAt
    updatedAt
  }
}

BBopanna commented 2 years ago

@iartemiev - please take a bow. Thankyou - very clear.

So on some day - say 5 years from now - say data has grown to couple of billions and we want to selectively purge, what will be your recommendation ?

ps - dint know @default existed, always wanted this feature though - thanks for sharing.

iartemiev commented 2 years ago

So on some day - say 5 years from now - say data has grown to couple of billions and we want to selectively purge, what will be your recommendation ?

The easiest way to automatically purge records from DDB is through configuring TTL (check out the linked overview and all the subsections).

If you need to archive the data that you're deleting, you can use TTL in concert with a data lake pipeline such as this one: https://aws.amazon.com/blogs/database/automatically-archive-items-to-s3-using-dynamodb-time-to-live-with-aws-lambda-and-amazon-kinesis-firehose/

BBopanna commented 2 years ago

Thankyou @iartemiev - closing this ticket- Resolved.

aws-amplify / amplify-js