Closed BBopanna closed 2 years ago
@ErikCH - Please note, thanks.
@BBopanna for large data sets we recommend you utilize indexes based on your data access patterns to avoid having DynamoDB perform scan operations against your entire table. Using indexes will allow AppSync to function much more efficiently.
Please see these guides https://docs.amplify.aws/cli/graphql/examples-and-solutions/ https://docs.amplify.aws/cli/graphql/data-modeling/#configure-a-primary-key
@iartemiev - we are aware of the index, can you comment on the issue - that multiple calls will have to be made to get to all the data which matches a filter condition since the filter condition is applied to the each page of data and not to the entire data set. Please look into the example give above in the description - Index will not solve the issue as what we have understood.
For us to better assist you, could you provide a concrete use case you want to implement with AppSync? What kind of application are you building and what are its high-level requirements? What’s the data model? What are your app’s data access patterns?
Regarding point 1. in your post:
AppSync GraphQL filters are applied is to the first 1000 records and any matching records in that 1000 records are returned
Can you point me to where you're getting this information?
if there was a 1million (1,000,000) records in a table 1,000,000/1000 = 1000 call in worst case, pagination after pagination with nextToken
This behavior can vary widely and depends on your data model and resolver implementation. What are you basing these calculations on? Can you share your resolver VTL or point me to the AppSync docs you’re referencing to arrive at the 1000 call figure?
@iartemiev - can we please get on a call so that we would explain the use case and also walk you through the response we have received from the internal support team of aws which lead us to the above numbers. If yes, please share you mail id.
@iartemiev ?
I would prefer that we communicate via GitHub. That way, other customers who have similar questions can benefit from the discussion.
Alternatively, feel free to post your questions in the #graphql-help
channel on the AWS Amplify Discord Server. It's actively monitored by experienced community members as well as AWS engineers.
Sure @iartemiev. Kindly bear with the long writeup below -
Requirement -
type Food
@model(subscriptions: { level: off })
@auth(
rules: [
{ allow: public, operations: [read], provider: iam }
{ allow: private, operations: [read], provider: iam }
{ allow: owner, ownerField: "owner", operations: [read, create, update, delete], provider: userPools }
]
)
{
id: ID!
@primaryKey
name: String! # name field for storing the food name
@index(
name: "byNameByPostalCode"
sortKeyFields: ["postalCode"]
queryField: "listFoodByName"
)
postalCode : Int # postalCode field for knowing in this locality it is available
@index(
name: "byPostalCodeByName"
sortKeyFields: ["name"]
queryField: "listFoodByPostalCode"
)
}
Number of records on this model backed DynamoDB will be 1 Million + , data will be randomly distributed by postalCode. Example : Burger for 30005 will 1st record, 500,000 record and say 1,000,000 record (just and example, point is data is very randomly stored in the table, no ordering)
Options to Search for ALL burgers available at 30005 postal code :
Option A - Use the regular list query and filter by postalCode. Limit is 1000 by default as per AWS support ticket response we have got, pasted below. We will get the 1st Burger record with nextToken, keep using the nextToken keep making a whole of requests (to be specific 499 requests ) to get the record at 500,000 the position, till then we get empty set with nextToken. And to get to the next matching record another 499 requests. so in all 998 request to get to 3 matching records.
Option B - Use @Searchable on the model. This will add Open Search with Elastic search - this will increase the limit from 1000 to 10,000. Everything else is same a Option A reduces the problem by order of 10. So now we have to make 98 queries instead of 998.
98 queries is also too much from performance - time stand point and no will use a service with this kind of latency., along with @Searchable adding additional cost - this too is not an IDEAL solution. So what is the recommended solution to search through large data like this ???
Excerpts from conversations with the AWS premium support team is as below, after sanitising some content -
From the case notes, I understand that you have large amount of data and would like to know the best way to get the records with filter expression on list query and to avoid multiple queries with next token. Additionally you would like to know alternatives for the same. Please correct me, if I have misunderstood your query.
Firstly I would like to explain that, when using a list query with AppSync with DynamoDB it will perform a Scan[1] operation on the table and return X amount of items that match the filter expression based on the number of evaluated items. A Scan operation will simply check the first X amount of items in the table and then return which ever items match the filter expression in the first X items. This value of X is determined by the limit variable passed into the query or if no limit is supplied in the query it will use the default limit configured in the request mapping template. This default limit for example for an Amplify generated API will be 100, which means that the scan will check only the first 100 items and leave the rest alone. The maximum value for the limit variable for a GraphQL API is 1000. if you run a query and your items are not within the first X evaluated items the request will return either too few items, or none at all, instead of the expected result which would be all items.
Hence I would like to list the possible approaches to perform list query operation with filter expression on large amounts of data.
1) As mentioned previously the number of items to be evaluated by DynamoDB is determined by the limit variable. It is important to remember this limit variable represents the number of items to evaluated and not the number of items that will be returned in a given query. If you want to evaluate more items you can simply change the default limit in the mapping template or pass in a limit variable into the List query like below:
listFoods(limit:1000, filter: {postalCode: {eq: 30005}}) {
items {
id
name
}
}
2) The second option is to use pagination[2] as mentioned you. Pagination is the process of separating print or digital content into discrete pages. For AppSync this simply means breaking down our list queries into separate subsequent requests which each request will return its own data which we can combine to get the full result. Pagination with AppSync can be done by adding a nextToken field to your query.
listFoods(limit:1000, filter: {postalCode: {eq: 30005}}) {
items {
id
name
}
nextToken
}
The nextToken field represents the last evaluated key in the query. The nextToken value returned by the query will either be 'null' or a long string starting with "ey" and ending with "=". If the value is null this means that there are no more items to be evaluated in the table. If a non-null result is returned this means that there are more items in the table to evaluate.
This nextToken value is the representation of where the previous query finished scanning the table and if we want to get the rest of items in the table, we would run a subsequent query passing in the nextToken as a query variable. We can then continue doing this until the value of the nextToken returned is null meaning we have evaluated all items in the table.
listFoods(limit:1000, filter: {postalCode: {eq: 30005}}, nextToken:"ej……=") {
items {
id
name
}
nextToken
}
3) You can also consider indexing your data to improve performance. You can create a global secondary index[3] for the field to query against. For example, in Amplify, you can use the @index directive to define secondary indexes: [+]https://docs.amplify.aws/cli/graphql/data-modeling/#configure-a-secondary-index
However Indexed queries also return a maximum of 1 MB of data at a time. For larger indexed queries, you need to use pagination again with help of next token.
4)For effective searching against data you need to use @searchable.The reason the search query works is that ElasticSearch can evaluate more items than DynamoDB which is 10,000 records. [+]https://docs.amplify.aws/cli/graphql/search-and-result-aggregations/
Here is another very similar ticket which was closed without being given a good answer - https://github.com/aws-amplify/amplify-cli/issues/4166
Its been 5 days and no response on this, can we please get a response ?
@BBopanna, thank you for your patience and for providing the detailed information about your use case.
First off, I’d like to point out that the limit
property does not have a maximum value. The only relevant limitation is the 1 MB result set per call size limit from DynamoDB (DDB). However, with the schema you provided above, I would estimate your average item size to be around 100-150 bytes, so you would be able to retrieve ~ 6.6 - 10k items in a single query operation without using pagination at all.
For your use case I recommend querying on your Global Secondary Index (GSI), which you’re already creating in your schema with the @index
directive.
Amplify will generate the following 2 query statements for your schema in src/graphql/queries.js
: listFoodByName
and listFoodByPostalCode
.
If you want to efficiently query for a particular food at a given postal code, you would then use
listFoodByPostalCode(postalCode: 30005, name: {eq: "Burger"}, limit: 10000) {
items {
id
name
postalCode
createdAt
updatedAt
}
}
The resolver mapping template generated for you by Amplify will produce a DDB Query mapping document from this request. The DDB Query will perform a direct look up of your data based on the Partition Key (postalCode
) and Sort Key (name
).
This will return all of the records that match these parameters, up to 10k matches (the value of limit
) in a single query operation. Even if you have millions of items in your DDB table, as long as you’re expecting less than 10k results back, you will get them all back with a single request without pagination and without performing multiple calls to DDB.
Just to reiterate, using this example:
Number of records on this model backed DynamoDB will be 1 Million + , data will be randomly distributed by postalCode. Example : Burger for 30005 will 1st record, 500,000 record and say 1,000,000 record (just and example, point is data is very randomly stored in the table, no ordering)
You will get all of these Burger records back with a single GraphQL Query using listFoodByPostalCode
.
Thankyou @iartemiev - on the max value for limit - Why is your support team giving wrong information ?(wont give the caseId or the name of the agent here, dont want to throw them under the bus)
So what is the purpose of the limit variable ? to specify a lower limit but not an upper limit ?
We will try this (with a reasonably large data set) and revert - please keep this ticket open (for atleast a week, till we revert back)
So what is the purpose of the limit variable ? to specify a lower limit but not an upper limit ?
It's an upper limit. In the example I provided above if we decrease the limit to 10 listFoodByPostalCode(postalCode: 30005, name: {eq: "Burger"}, limit: 10
- the query will return up to 10 items.
The important distinction here is between executing a GQL query that performs a DDB Scan with a filter vs. executing a GQL query that performs a DDB Query operation. I'll provide a detailed comparison of these operations below.
(Note that the 1 MB result size limit applies to scans as well as index queries. This is a DynamoDB limit (independent of AppSync). You can find a list of all the DynamoDB service limits here and the AppSync service limits here.)
If we execute this GQL query with a filter expression
listFoods(limit: 10000, filter: {and:[
{postalCode: {eq: 30005}},
{name: {eq: "Burger"}}
]}) {
items {
id
name
}
}
The request resolver mapping template generated for you by Amplify will produce a Scan mapping document
{
"version": "2018-05-29",
"operation": "Scan",
"limit": 10000,
"filter": {
"expression": "((#name = :and_0_name_eq) AND (#postalCode = :and_1_postalCode_eq))",
"expressionNames": {
"#name": "name",
"#postalCode": "postalCode"
},
"expressionValues": {
":and_0_name_eq": {
"S": "Burger"
},
":and_1_postalCode_eq": {
"N": 30005
}
}
}
}
In this case DynamoDB will read up to limit
records from the table, creating a result set of 10k records first. Afterwards, it removes records from that result set that do not match the filter expression. So the number of records you get back can be anywhere from 0 to limit
depending on the filter expression you specified.
Essentially this operation works exactly as you described in your original post, but you can adjust the limit
variable to any value. E.g. you could set the limit
to 1000000
and DDB will scan through 1 million records before applying the filter expression to them. After applying the filter expression, if the resulting document size is over 1 MB, you will need to paginate through the results to get all of the matching items back.
Let's say we have the following contents in our table
id | name | postalCode |
---|---|---|
1 | Taco | 10005 |
2 | Burger | 70005 |
3 | Burger | 30005 |
4 | Steak | 30005 |
5 | Soup | 30005 |
6 | Salad | 30005 |
7 | Pasta | 30005 |
8 | Burger | 40005 |
9 | Rice | 30005 |
10 | Burger | 30005 |
If we execute
listFoods(limit: 5, filter: {and:[
{postalCode: {eq: 30005}},
{name: {eq: "Burger"}}
]}) {
items {
id
name
postalCode
}
nextToken
}
DDB will scan through the first 5 items, then apply our filter expression and return the matching result: | id | name | postalCode |
---|---|---|---|
3 | Burger | 30005 |
Along with a nextToken
value that we can pass to the next invocation of our GQL query.
Pros:
Cons:
Alternative:
Once we've established that a common data access pattern for our application involves fetching data by postalCode
and by name
or by a combination of the two, we can create GSIs on these fields to optimize our queries. When using Amplify this is accomplished through the @index
directive just like you have in your schema.
This enables us to perform index queries on our table.
listFoodByPostalCode(postalCode: 30005, name: {eq: "Burger"}, limit: 10000) {
items {
id
name
postalCode
}
}
Note that we're not using a filter
expression here.
The request resolver mapping template generated for you by Amplify will produce a Query mapping document
{
"version": "2018-05-29",
"operation": "Query",
"limit": 10000,
"query": {
"expression": "#postalCode = :postalCode AND #sortKey = :sortKey",
"expressionNames": {
"#postalCode": "postalCode",
"#sortKey": "name"
},
"expressionValues": {
":postalCode": {
"N": "30005"
},
":sortKey": {
"S": "Burger"
}
}
},
"index": "byPostalCodeByName",
"scanIndexForward": true
}
Unlike a filter expression, the query expression is used to retrieve the initial result set up to limit
records. So we will retrieve up to 10k items from the GSI that match on our Partition Key - postalCode
and our Sort Key - name
.
Using the same simplified data set
id | name | postalCode |
---|---|---|
1 | Taco | 10005 |
2 | Burger | 70005 |
3 | Burger | 30005 |
4 | Steak | 30005 |
5 | Soup | 30005 |
6 | Salad | 30005 |
7 | Pasta | 30005 |
8 | Burger | 40005 |
9 | Rice | 30005 |
10 | Burger | 30005 |
If we execute
listFoodByPostalCode(limit: 5, postalCode: 30005, name: {eq: "Burger"}) {
items {
id
name
postalCode
}
nextToken
}
We will get back both records
id | name | postalCode |
---|---|---|
3 | Burger | 30005 |
10 | Burger | 30005 |
Pros:
Cons:
Note - you can combine query and filter expressions as well. It doesn't have to be one or the other. In that case, DynamoDB will create the initial result set based on the query expression, then apply the filter to it, removing any items that do not match the filter expression from the final result set.
E.g. if you add the field rating
to your schema:
type Food @model
{
id: ID! @primaryKey
name: String!
@index(
name: "byNameByPostalCode"
sortKeyFields: ["postalCode"]
queryField: "listFoodByName"
)
postalCode : Int!
@index(
name: "byPostalCodeByName"
sortKeyFields: ["name"]
queryField: "listFoodByPostalCode"
)
rating: Int
}
And you wanted to retrieve all the items for a particular food at a location and filter them by rating > 3, you could use:
listFoodByPostalCode(postalCode: 30005, name: {eq: "Burger"}, filter: {rating: {gt: 3}}, limit: 10000) {
items {
id
name
postalCode
}
}
DDB would first retrieve up to 10k records by PK and SK, then remove all items from the result set where the rating is 3 or below.
Wow @iartemiev - that answer is very comprehensive - one of the best answers ever in github imo.
Very clear @iartemiev - we will try this on our app and post an update.
If there is anyway to promote this answer or add this use case to your amplify appsync docs - many many more folks will find it very useful.
Sincere gratitude for such detailed answer! Damn!
Now want to extend this discussion a bit further -
Lets take order entity, use case - customer searched for burger at 30005 with rating > 3 , picks one of the listings and places the order. Sample order entity like below - THIS ENTITY WILL CONTAIN THE LARGEST AMOUNT OF DATA IN OUR APP SINCE MANY MANY ORDER RECORDS ARE CREATED DAILY, HYPOTHETICALLY 10,000+ ORDERS ADDED DAILY. Now most of the case customer and business cares FOR the upcoming order, of course there will be need to show order history too in a different part of the app.
What kind for entity modelling can you recommend for this case such that we have good retrieval performance for super large data sets like this ? Also we do not mind running a scheduled task to move the orders from yesterday/previous date to another storage to keep this entity light, do we need another entity say orderHistory ? If yes then that table too will have ginormous amount of data since we keep moving to it daily - we plan to keep order history for 5 years. Looking forward for you recommendation.
type Order @model
{
id: ID! @primaryKey
customerId: String!
@index(
name: "byCustomerIdByTime"
sortKeyFields: ["orderTime"]
queryField: "listOrdersByCustomeByOrderTime"
)
orderTime : AWSDateTime!
@index(
name: "byOrderTimeByCustomerId
sortKeyFields: ["orderTime"]
queryField: "listOrderTimeByCustomerId"
)
more fields.....
}
Hi @iartemiev, we tried GSI to retrieve records and we were able to retrieve records as expected. Thank you for the solution.
Can you please answer the above query?
I’m glad to hear the suggestion worked for you!
Even if you’ve got 10k+ records generated per day you will accumulate ~4 million records per year. We have customers with billions of items in their DDB tables who continue to experience consistent performance at that scale. Here are a couple of case studies to check out: https://aws.amazon.com/solutions/case-studies/duolingo-case-study-dynamodb/ and https://aws.amazon.com/solutions/case-studies/amazon-primevideo-dynamodb/. I would not worry about moving the data out of your table any time soon in order to accommodate your primary use case. As long as you have good indexes on your data and avoid unnecessary table scans for common query patterns, you should not need to do anything extra.
That being said, if there’s a business need for performing analysis and business intelligence on your Order data, you may find value in warehousing your Order data in Amazon Redshift. Check out the following guide: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/RedshiftforDynamoDB.html. I definitely don’t think this is something that you have to do early on, but it may be worth considering down the road if such a need arises.
For our primary use case we’ll start with the access patterns, as always. It appears we have the following two:
For (1) we need to have a way to track the status of an order. I’ll add a status
field to the type along with an enum
for order statuses.
Amplify automatically adds the createdAt
and updatedAt
timestamp fields to each model. The values for these fields are set serverside in the resolvers (createdAt
and updatedAt
are first set in the create mutation resolver and updatedAt
is set in the update mutation resolver). I’ll use the createdAt
field as the Sort Key for one of our GSIs, but if you prefer to use your own timestamp field, that’s totally fine as well.
type Order @model {
id: ID! @primaryKey
customerId: String!
@index(
name: "byCustomerIdByTime"
sortKeyFields: ["createdAt"]
queryField: "listOrdersByCustomerByCreatedAt"
)
@index(
name: "byCustomerIdByStatus"
sortKeyFields: ["status"]
queryField: "listOrdersByCustomerByStatus"
)
createdAt: AWSDateTime
status: OrderStatus! @default(value: "pending") # @default isn't necessary here, just using it for demonstration
}
enum OrderStatus {
pending
completed
cancelled
}
Here are some examples of index queries we are able to perform with this schema
listOrdersByCustomerByCreatedAt(customerId: "1", createdAt: {gt: "2022-07-03"}, limit: 100) {
items {
id
customerId
status
createdAt
updatedAt
}
nextToken
}
listOrdersByCustomerByStatus(customerId: "1", status:{eq:"pending"}) {
items {
id
customerId
status
createdAt
updatedAt
}
}
Alternatively, if you’d like to treat the status
field as a range attribute, you can type it as an Int
instead
type Order @model {
id: ID! @primaryKey
customerId: String!
@index(
name: "byCustomerIdByTime"
sortKeyFields: ["createdAt"]
queryField: "listOrdersByCustomerByCreatedAt"
)
@index(
name: "byCustomerIdByStatus"
sortKeyFields: ["status"]
queryField: "listOrdersByCustomerByStatus"
)
createdAt: AWSDateTime
status: Int! @default(value: 0) # @default isn't necessary here, just using it for demonstration
}
In your application you can represent the value to correspond to a particular status, e.g. 0
= “pending”, 1
= “completed”, 2
= “cancelled”, etc.
This will give you the ability to query for combinations of statuses, e.g.
Retrieve all orders that are not pending
listOrdersByCustomerByStatus(customerId: "1", status:{gt:0}) {
items {
id
customerId
status
createdAt
updatedAt
}
}
@iartemiev - please take a bow. Thankyou - very clear.
So on some day - say 5 years from now - say data has grown to couple of billions and we want to selectively purge, what will be your recommendation ?
ps - dint know @default existed, always wanted this feature though - thanks for sharing.
So on some day - say 5 years from now - say data has grown to couple of billions and we want to selectively purge, what will be your recommendation ?
The easiest way to automatically purge records from DDB is through configuring TTL (check out the linked overview and all the subsections).
If you need to archive the data that you're deleting, you can use TTL in concert with a data lake pipeline such as this one: https://aws.amazon.com/blogs/database/automatically-archive-items-to-s3-using-dynamodb-time-to-live-with-aws-lambda-and-amazon-kinesis-firehose/
Thankyou @iartemiev - closing this ticket- Resolved.
Before opening, please confirm:
JavaScript Framework
React
Amplify APIs
GraphQL API
Amplify Categories
auth, storage, function, hosting
Environment information
Describe the bug
Appsync, backed by DyanamoDB, does not return all the filtered data in a single call.
AppSync API's have filter conditions to search for specific data sets, however the way AppSync GraphQL filters are applied is to the first 1000 records and any matching records in that 1000 records are returned. This way, if there was a 1million (1,000,000) records in a table 1,000,000/1000 = 1000 call in worst case, pagination after pagination with nextToken !!!!!- NOT PERFORMANT AT ALL, NO one will use this kind of API and the App will fail badly in performance.
Solution that AWS support team offered is "Use @Searchable" on the entity, this will configure OpenSearch Service which is powered by Elastic Search.
@AWS team please advise on how to go about in cases like these where there are 1Million+ records in different tables - this seems like a very real world scenario. In traditional RDBMS, good indexes will get this job done without hiccups and the search/where condition is applied to the entire data set rather than a limit/sub set size of data like in AppSync. Please recommend the best solution.
Expected behavior
Should be able to get all filter/search condition matched data in 1 or couple of calls to keep the performance of the application using Appsync optimal.
Reproduction steps
NA
Code Snippet
Log output
aws-exports.js
No response
Manual configuration
No response
Additional configuration
No response
Mobile Device
No response
Mobile Operating System
No response
Mobile Browser
No response
Mobile Browser Version
No response
Additional information and screenshots
No response