Total counts in GraphQL queries

houmark commented 5 years ago

Is your feature request related to a problem? Please describe. I think it's way past due that Amplify supports total counts in GraphQL queries. Other GraphQL based platforms has this built in by simply adding totalCount (or similar) in the query and no matter the limit, they'll get back the total count in addition to the (filtered) data.

Describe the solution you'd like This should at least work for DynamoDB backed models and of course also for search based queries that pass by ElasticSearch.

Describe alternatives you've considered Making a Lambda function that is a field in each model using the @function directive, but since we are both using listItems and searchItems with filters added, the implementation is not simple as we have to reapply those filters on the lambda function for getting the correct count.

Making custom resolvers seems like another not very fun route and not very scaleable or maintainable, and once again, this should be a "out of the box one liner" to have available as a developer. With either a Lambda or some other custom resolver I'm looking at hours or days of development.

Additional context This is a must have feature and there's not really any workaround for displaying total counts for systems with many items — at least that I know of. I read several bug reports, but none of them seems to have a simple solution. That it has not yet been developer by AWS is beyond my understanding, as pulling out counts is one of the most common things to do when developing web apps.

PeteDuncanson commented 4 years ago

I've been thinking about this one as its an area in our product development we are now getting to. I'd like to warm this up with a few options that hopefully might lead to something. Like a lot of others I'd just assumed that this would be trivial, a fault of my not understanding document databases well enough I guess. Its a bit of a trap to fall into when so far down the line of developing, try telling your client/manager that you need 5 days to get out the total records for something...its not a good look.

One issue with this whole thread seems to be the assumption that whatever the solution it has to cope with zillions of records when in fact a lot of apps might only be counting in the 100's or 1000's in some tables so I don't think there needs to be a one size fits all solution. I get that some of those solutions could cause some big issues if mis-used but with some knowledge of any constraints or suggested working parameters that should be safely navigated.

I'll start off with the smallest (and some might say naive) solution first and build up to the bigger ones. Remember, there isn't a one size fits all solution. My focus is on trying to make it easy to add this functionality with minimal hassle from the "developer building out the schema" point of view.

Using a scan to count records

If you don't have many records (< 1000 or so) you might be able to get away with using the scannedCount field as described here https://www.riccardosirigu.com/blog/aws-amplify-dynamodb-total-count-graphql-queries/ although it can "top out" when it hits its 1MB result set limit and so report a false total. Ideally it would be nice to know if there are more that could have been returned or if that is the total amount i.e. was there another page of results possible? That would at least give you the option of saying "record 1 of at least 246+" or something if we can't get an exact amount back.

Another option here is to do a full paged scan if I opt into it so I can get a "true total". This would loop through all the pages of results for a scan and return the total. I keep hearing that scans are expensive but again if I've only got a relatively small number of records that might be a hit I'm happy taking for now and just swallow the hit and push off doing it "properly for epic scale" in the future when it becomes a happy problem (ie we've grown so much we need to change it).

I can see this being a new directive that could be added to a model, something like:

type Posts @model {
    title: String!
    content: String!
    author: String!
    published: Boolean!
    totalPosts: Int! @scannedCount
}

This new @scannedCount directive would automatically do a scans of all the records in the Post table that could page through all the results until it finishes (thus masking from us the issues with the 1MB returned data limit). Used as shown above it would just give the total of all records for that table.

Now if we ignore the obvious performance traps that this directive sets up for us (we will discus those next) I think we can all agree that from a users point of view that looks like a simple and clean way of getting what we want, its a super quick way to get a total value out, problem solved from the users point of view initially. For fast prototyping or "good enough" features that might do me.

So now on to the problems this will cause or that need solving. Luckily now we have a directive we can add to it with some attributes to give us some finer grained control or indeed we could create multiple directives to give us the following functionality.

A quick note on Lambda functions, we are going to need some generating, would it be worth while if these are generated and stored in the /backoffice/functions/ folder as user generated functions are. That way should a user want to extend the function with additional logic they can. The functions this directive might generate should be all nicely stubbed out with pointers on where and how to add additional logic and would need pushing to deploy like any other user generated function. This directive shouldn't hide all the magic in the cloud and VTL templates, it should do the minimum to make the common case go a way and then give the user an extension point to add additional logic as their needs demand. This might mean that removing the directive from the schema won't delete the resources from the cloud automagically if you've edited them, I'd be happy with a warning about "Some functions appear to have been edited so you will need to manually delete the following or (A) delete all anyway".

Too many paged scans

If left unchecked this could get out of hand with the number of scanned pages growing over time so as a fail safe we should have a default number of pages that this will loop over before returning. The default can be agreed later but if the user wants to opt-in to setting it then they could in the directive like so:

totalPosts: Int! @scannedCount( maxPageDepth: 5 )

The maxPageDepth would be an optional param to set how many pages you are happy to do to get that total to stop it all grinding to a halt in the future (but hey we opted into this solution, its not ideal its pragmatic and we know the trade offs).

The trouble with the above is the same as the scannedCount solution from the blog post, it doesn't tell you if it reached the end of the pages or not, so you've now way of knowing if the reported number of results is the total amount or just the amount it got before hitting maxPageDepth. But it might be good enough for the right sized app and it sure is simple to do from a users point of view. As long as the limitations are documented they users should be able to make an informed choice.

Additionally maybe a trigger in CloudWatch could be created as part of using this directive to ping a notification when a field is taking too long to rescan? A nice bit of Developer Experience and another way to introduce a new feature to those of us still learning our way around what AWS has to offer.

Total results for filters, sort keys or conditions.

I want something other that the total count, I want a count of some filter or other. This ones a bit trickier but could we allow filters in the directive?

totalPosts: Int! @scannedCount
totalPublishedPosts: Int! @scannedCount( filter: "published=true") #something like that anyway

That would work but its not dynamic, I can't pass in a filter at runtime but it would be good for various counters/stats that could be handy to have. I wonder if we could generate a GraphQL query that would allow you to pass a filter through to this directive/resolver (the lines become blurred as to which this now is at this point, guessing its a resolver under the hood but then I don't know enough to know what a resolver does under the hood as yet so my terms might be wrong here, feel free to enlighten me).

Sort keys could be passed through too maybe to give you a total for a subset in a similar manner, its not my strong point so this part of the ideas is hazy on purpose but here to get you thinking.

Another issue that @houmark raises repeatedly is that the number of results that match a filter can only be worked out if we scan all the records and also it doesn't work if we have a limit. This @scannedCount would do both I think in its own brute force way.

Caching.

Do we really want or need to look this total up every time? If its a small number of records then why not, it wouldn't be all that slow nor costly (again we are talking about relatively small number of records here not eBay products or Uber rides size apps) and we could cache it on the frontend if needed.

But if you want to opt in to caching we could do it one of two ways that could be set via the directive at the expense of another table and a function.

You could do this with a simple boolean or enum (to allow for more options in the future):

totalPosts: Int! @scannedCount( cached: true )

I'm not 100% sure of the internal here so I'm going to propose what I think it could do with my understanding of the options, there might be other better ways but this should give the gist. We need to store the value somewhere and then pull it back out.

One way would be to do this under the hood:

Spin up a new dynamodb table to store all the totals in, one row per requested "scannedCount" directive in the schema. If another @scannedCount directive has already created this table we can just re-use it? Along with storing the total would be also store some other fields like model name, target field, createdAt and updatedAt.
Then we automatically generate a Lambda function that would invalid or recalculate the value for the cache. There are a few ways to do this that I'll discus below.

Recalculate the cached value for any CRUD operation

This version uses a Lambda trigger function using DynamoDB Streams (one of the options when creating a function via the CLI so a known thing) that hooks into this models table updates and will rescan the table whenever a record is added/removed. This is a bit brute force but it would work.

The trouble with his method is it could cause needless thrashing of the tables when doing an import, we don't need that number generated on every entry if we are adding 100's or 1000's of records. I might be that you can live with the hit if the numbers are small but an option to disable the caching while doing an import but still trigger it to do a final count at the end would be good. I sadly don't know enough of how to do that nicely. You could add a "disabled" boolean attribute to the directive maybe but that would need another push to get it up to disable it and then how to get it to recalculate after enabling it again without just "touching" an existing record afterwards? That might be enough?

Invalid the cache on any CRUD operation, recalc when requested if needed

totalPosts: Int! @scannedCount( invalidOnCRUD: true )   # open to suggestions on any of these names btw

Another option is similar to the above except we don't do the rescan when the record changes we just null the value in the cache table. Then our resolver that gets the total for that field will do the rescan for us if the value is null. Depending on the data model and workflow of the app this might be a better fit.

My only worry here is that we separate out the two bits of cache work to two separate areas, the function to clear the cache and the resolver to build the cache.

Generate the totals on a recurring timespan

An alternative if we are ok with the totals potentially not being 100% accurate at any given time is to have a timed Function do the rescan repeatedly after a set timespan. Again this could be set via the directive:

totalPosts: Int! @scannedCount( refreshCountInSeconds: 300 ) # regenerate every 5 minutes

I don't know the best practise for passing timespans around so that is up for debate but hopefully you get the idea.

How to get the value from the cache into our model?

How we get that cached value out into that model though I'm hazy on, I guess we just have a resolver generated which would go get the value from the "cache table" but better minds than mine will know how best to do that.

Rather than having these fields on the model themselves do we need a PostTotals model generated instead that can store all these top line numbers on and possibly pull these from the cache table via a resolver or when generated by amplify codegen etc? In that case does the directive move up to the whole table rather than the field level?

type Posts 
    @model 
    @scannedCount( fieldName: "totalPosts" ) {
    title: String!
    content: String!
    author: String!
    published: Boolean!
}

Running totals for larger record sets

What if we out grow the @scannedCount directive above, what options do we have? Scanning is too slow due to the number of records we've got. This might be where a @runningTotal directive might be the answer.

It has been suggested above that you can keep a rolling count of various totals by hooking into the "DynamoDB stream" which will fire an event every time you add/update/delete a record in DynamoDB. We could then use these to increment or decrease a totals counter.

This solution though has some instant problems:

Its not all that user friendly to tell developers they have to do all this additional work for something that should in a lot of minds be "baked in" - that is not to say that Amazon should be expected to do all the work and I'm sure that it is doable in a morning or so to wire something together but it just kills productivity, adds doubt to "have we made the right choice here?" and Amplify is meant to be about speeding up the gluing together of these services to let you focus on the "building an amazing app" rather than becoming an expert in yet more Amazon services and DevOps from day one. Just testing that setup can take a while...and they how much testing do you need before you can truly "trust it" and sleep at night I wonder? Better I think that Amplify make some of pain go away for the obvious use cases and then give users with edge cases place to start from.
In works for the top line numbers but not so much for edge cases - again these could be handled by the user but the boiler plate for this could be put in place and the functions for it stored in the solution so they can be edited and expanded on.

I'd like to see a directive that could wire all this up for me from the schema, this would spin up the functions needed (stored locally for editing) and store the values in a separate table similar to the cache table mentioned earlier if not the same table but different records so we don't clash? Lambda trigger functions hooked into DynamoDB streams would do the adding/subtracting. As these are built to run all the time we won't have the same potential scaling/performance problems we had with the @scannedCount directive.

type Posts 
    @model 
    @runningTotal( fieldName: "totalPosts" ) {
    title: String!
    content: String!
    author: String!
    published: Boolean!
}

Now what about that idea to have multiple counts on a table? Could we do this:

type Posts 
    @model 
    @runningTotal( fieldName: "totalPosts" ) 
    @runningTotal( fieldName: "totalPublishedPosts", filter: "published=true" ) {
    title: String!
    content: String!
    author: String!
    published: Boolean!
}

I don't see why not. Need more power? They you should be able to go edit the function code or copy it and now you've got a great starting point to adding in a few more lines to get you more counters as needed. Easy of development continues and I think we are still using all the tech that Amplify already has and uses so not a massive stretch of knowledge or comfort area?

I want more POWER!

Getting into wish list land with this last one, suggest the above gets done before this one gets looked at but...

Final option is something similar to what @RossWilliams suggested above that of "going large" and streaming the table data from dynamo into a Relational database so you can do full on SQL magic on that. This one currently is beyond my pay grade, I laughed when I first read Ross' comment and it sounded crazy complicated but after a bit of reading I can see how it all might stitch together...but again...should someone have to do all that to get some stats or counters out of their data? The answer might simply be "you are using the wrong database in the first place!" and I should go with a relational database from the off...but that isn't how Amplify is marketed or sold to me!

The basic idea is that by hooking into a DynamoDB Stream we can push record changes to a relational database and then from there we can pull out whatever data SQL will allow us to do. Sounds excessive but if its what we've got then lets use it. Could another directive handle this one?

type Posts 
    @model 
    @streamToSQL {  # Guess this would be RedShift?
    title: String!
    content: String!
    author: String!
    published: Boolean!
}

This would push any CRUD operations into a relational database, I guess you would need to add this database first as a resource via the CLI amplify add sql. It would create a table name the same as the model (could allow you to over-write this in the directive).

What to do about relations and @connections though. In theory it could include those and I guess that would be the dream.

This does raise the idea that do we need a @sql resolver similar to the @http resolver so we could quickly get the results of a sql query into our GraphQL? I'll leave that one for another issue I think ;)

What about auth?

@houmark raises the tricky point about authentication earlier and to be honest I don't know. I'm not all that sure if this is an issue with passing @auth to elastic search or just within the api. I'm hoping its an elastic search thing which would mean that the above solutions should work with the possible exception of the relational database idea, I can see that needing a lot more thought.

It should be possible to apply @auth to the model or the individual fields like normal which should cover it for the @scannedCount and @runningCount directives. Alternatively if we use a separate table like PostTotals a discussed above then @auth could be applied maybe to that? That should re-use the logic already in place I think?

Summing up

Sorry this is so long, it wasn't meant to be, its a tough problem and I've been giving it a lot of thought as I hope you can tell. I'm hampered though by only knowing some of what is needed to know the best route but hopefully this adds to the conversation and brings a users point of view to how I'd personally like it to roughly work and what constraints I might be happy living with to get some of the functionality I and others say we need.

I've tried really hard to leverage what Amplify already has or uses so that it shouldn't be a massive stretch of the imagination or skillset to get it to do what it needs to.

Lets keep the conversation going, what have the Amplify devs come up with in the time this has been open, where might it go and how can we get there (and ideally an idea of when)?

RossWilliams commented 4 years ago

tl;dr; - You should create a Lambda resolver for this field and share your code for others to start from. I don't see an automated solutions fitting in well to the project without significant effort and problems to overcome

@PeteDuncanson you are an excellent writer and I thoroughly enjoyed reading this. My thoughts and responses are a lot more scattered, but let me try to explain my crazy ideas and somewhat address your proposals. I agree the proposals I have made in previous comments are not ideal, but still believe they are the best we have given the path Amplify has taken.

I too have a love / hate relationship with DynamoDB. If I was designing Amplify I would not choose it for many of the reasons mentioned by other commenters in the issue. It is designed as a replacement for multi-node systems, and therefore has missing convenience features developers have come to expect. My main issue is the cost for developers to learn it and develop in-house patterns to use it is too high. I'm going to guess most Amplify projects plan to store under 100GB of data. At these levels the cost of an RDS solution is smaller than my developer costs. RDS will also have better performance on these lower utilisation levels. Alex DeBrie, author of The DynamoDB Book, has similar opinions. Upsides are that DynamoDB is in the free tier for developers to get started, and that has to help with adoption.
1. Proponents of DynamoDB can be pretty condescending in their language. I've seen AWS employees talk about SQL users as getting left behind in the past, this is the future you are going to have to learn it anyways, when they teach customers they get it in under a week, if you only store 100GB you aren't ambitious or you have a hobby rather than a business. If in my comments I suggested DynamoDB is an any way easy, I'm sorry.
2. IMO the main weakness of Amplify as compared to Firebase is dealing with DynamoDB instead of Firestore.
The limitations are not just from DynamoDB, but also AppSync's templates. If you are going to rely on AppSync to query DynamoDB, you can't page through results. You get a single API call to DynamoDB and thats it. So any solution for counts (where tables are over 1MB) are going to need Lambda resolvers.
Implementing the "simple" solution in Lambda is possible today. Attach a function resolver to a "tableCount" property, have the function read the context of the request, do an appropriate series of scans / queries, and return the result. But there are many reasons to not provide this "out-of-the-box".
1. The support issues. Imagine a "cold" table with 20MB of data. Average cold scan time is 60ms. Suddenly people start asking why such a simple GraphQL request takes over a second. I not only think this is realistic, but likely something that is only exposed to developers once in production. Pre-release there might be only 1MB in a table, but there may be support issues of "my API is slowing down 20% a day".
2. Your suggestion of a directive with a "max scanned count" is a good way to limit the downsides of a growing table, but gives me bad vibes from a development experience perspective. Especially since a filter could grossly undercount the true value depending on its distribution in the table. In the worst case, scanning the first 5MB might return zero results, and yet the true count could be greater than zero. Returning zero will get customer complaints if it is not accurate.
3. Caching is a set of trade-offs that are hard to please everyone. Again also a support issues. "My users just added an item and now they can't see what they wrote because the paging is off". Caching also has a complexity explosion with authorisation. There might need to be per-user aggregate counts. Caching with filters causes another factorial explosion of combinations. The size of the undertaking to cover off the edge cases makes me think, if you are going to keep your data in DynamoDB, that a per-app solution is best. I don't know if there is much boilerplate that can be written that doesn't make certain wrong assumptions for an individual app.
4. The pain of learning to read a DynamoDB stream and manually update counts is largely a one-time cost. I agree that there could be better boilerplate code and cfn templates to assist here. I'd go even further and say that the Javascript / Typescript Document Client is severely lacking in developer ergonomics. But once learned, on a fresh project it can be re-created in under an hour.
One of the reasons I suggested using another system for total counts with filters is that DynamoDB is really best used when you can minimise the variability in the requests. If you know the access patterns then creating aggregates is annoying, but not weeks of effort. But the earlier commenters were trying to solve a solution with high variability in requests - accepting arbitrary filters and getting aggregates there.
1. Side note - I don't think Amplify should have filters, or they should be opt-in and not enabled by default. There are too many issues created in GitHub where users treat filters as a SQL WHERE clause.
Arbitrary filters and total counts can be done in Elasticsearch today, and there is a directive to get data steamed into Elasticsearch today. If there is a need for arbitrary filters and total counts, this is my default recommendation. There have been two issues pointed out.
1. The counts may not be accurate because of authorisation rules. For my own work I'm not concerned about this because I don't have a project with 3rd party clients. I control the requests and I don't rely on authorisation rules to provide filtering, I only use authorisation rules to prevent bad actors.
  1. Auth-as-filter is a general bad pattern that should be stated better in the documentation
2. Amplify doesn't expose the total hits count. First, users can write their own AppSync templates to expose this field. For offset pagination, I've also submitted a PR, aws-amplify/amplify-cli#5098 to better support these users. And my memory is foggy here, but I believe the total count was added by swaysway a few months ago.
Reporting and analytics within DynamoDB is extremely hard for "exploratory" work. With RDS I could have a secondary node to run my arbitrary SQL without impacting users, but DynamoDB doesn't have that nicety. Having Firehose, Glue, and Athena is a big help to me, but keeping it maintained is not fun. I use streams to aggregate data for known reporting metrics. But some days I get some very interesting requests to look into patterns, or see if a metric is useful to add into the reporting suite. Streams and Glue gives me my data as parquet files in S3 so I can then run arbitrary queries on data that is relatively fresh.
If DynamoDB Streams -> Firehose -> Glue were a built-in pattern in Amplify, moving data to other data source would be a lot easier. When users run into problems such as needing aggregates, timeseries formats, graphs, and reporting they won't feel so stuck and frustrated. Its an overwhelming complexity of tools to manage and configure on its own, which is why I think Amplify needs to bring it into the CLI.
At a higher level, I'd like to see Amplify documentation encourage users to branch out beyond the CLI tool and configure more with cfn, Lambdas, or SAM tools. A set of nice sample Lambdas for how to handle common use cases, along with the cfn templates needed would both solve these user issues and help as a teaching tool for further custom work. Relying on third parties to publish blog posts has so far not yielded much fruit.

ilyagru commented 4 years ago

At an even higher level, imagine a developer (e.g. me, or a start-up – this is the target audience of AWS Amplify, right?) who creates an app which is basically a social network. There people can follow each other, create posts and like and write comments for those posts. This is pretty basic scenario/logic. And for this kind of apps displaying follower, like and comment count is quite crucial. Of course, at the launch the numbers are not so big and fall under 1Mb, but then with growth – the more users, the bigger those numbers are. The developer would like to choose AWS Amplify because of the infrastructure, ease, integrity and velocity mainly which are so needed at the early stages of new projects (Free Tier is also much appreciated). But suddenly he has a dilemma, why doesn't such a robust tool support such a basic feature? Should he fight these limitations (and then maintaining them) or just search for a tool which just supports that or at least where it could be done without hacking?

duwerq commented 4 years ago

@ilyagru this scenario is easily solved with a pipeline resolver or dynamodb stream. After adding a like, update the like count on the post. Where it gets tricky is paginated (even more difficult with filtererd) results

andre-scheffer commented 4 years ago

Does anyone on the AWS team even read this? So far I am missing an official statement on this topic. We have currently spent a lot of money on development and we are still faced with such a trivial problem.

I think the Amplify team only has two options. Either solve the problem or find an alternative to DynamoDB. It's not ok that nothing has happened on this subject for 13 months. Not even a responsible person has been assigned to this issue.

@Amplify Team, @aws-amplify-ops Please give us a statement.

undefobj commented 4 years ago

@andre-scheffer (and others) this is good feedback and myself and @kaustavghosh06 have responded on this thread several times. We are both members of the Amplify team. The problem is non-trivial as others have noted in their descriptions above of potential implementations, all which we are looking into in the future which was noted earlier. As such we'll keep this thread open to continue to measure interest as well as potential thoughts on solutions such as those outlined by @PeteDuncanson, @duwerq , and @RossWilliams. Having further open dialogue here on requirements, DX options, and suggestions for implementation are welcome.

andre-scheffer commented 4 years ago

@undefobj, thank you for your answer.

The problem is non-trivial

I definitely believe you, as far as your part is concerned.

As such we'll keep this thread open to continue to measure interest as well as potential thoughts on solutions such as those outlined by @PeteDuncanson, @duwerq , and @RossWilliams.

What time frame do you set yourself to collect enough feedback? Don't you think 16 months is enough?
For me amplify is a blackbox. So I just want use amplify as a customer of AWS and I'm not interessted in the solutions inside the blackbox. Even if they are technically sophisticated but they are not in my business focus. Anyway, counting and also sorting and other stuff like dynamic filtering is just a regular requirement for a modern application. Don't you agree?

and suggestions for implementation are welcome.

Did I get something wrong and Amplify is an open source project? We as paying customers of AWS should only supply the requirements. Sorry if that sounds arrogant, But this in an Amazon project and not any little community project.

So I have to ask again. What is your timeline on this issue? What is the Roadmap? Why is nobody assigned to this issue in GitHub?

PeteDuncanson commented 4 years ago

Did I get something wrong and Amplify is an open source project? We as paying customers of AWS should only supply the requirements. Sorry if that sounds arrogant, But this in an Amazon project and not any little community project.

It is open source but it is also quite a small team. Each team in Amazon is different and run their own way. Lets be grateful we have a team on this who are engaging and open to ideas and not tar them with the "big evil rich company" brush.

Yes at the end of the day this tech exists to get us to sign up and spend money with them but then so are the apps we build with it. In the meantime they are trying to build the best stack they can but there are pressures to doing that that need some consideration no doubt.

For my part I'm slowly getting up to speed on the problem some more and seeing if I can't come up with a @runningCount directive of my own, its a steep learning curve though. Hoping to get this up and running so I can dog food it a bit then offer it up here for discussion.

andre-scheffer commented 4 years ago

It is open source but it is also quite a small team

But someone at AWS needs to be responsible for this project? This someone needs to address issues like that. I don't understand why such fundamental problems have existed for 16 months. It's not good to see that there is such a long feedback loop without a final result like: 'Yes, we will fix that' or 'No, we won't implement it'.

open to ideas and not tar them with the "big evil rich company" brush

I fully agree, but it's still AWS with tons of resources. But yeah I stop complaining about it and try to find another solution like using an extra mirror database to do the math.

Hoping to get this up and running so I can dog food it a bit then offer it up here for discussion.

That would be awesome.

PeteDuncanson commented 4 years ago

Found aws-amplify/amplify-cli#2901 referenced in the docs (https://docs.amplify.aws/lib/graphqlapi/query-data/q/platform/js#frequently-asked-questions) so thought I'd link them together for anyone following the breadcrumbs

duwerq commented 4 years ago

The thing I don’t understand here is if you don’t like dynamodb or it’s lack of counts, you have an option to use another database. Nader is about to put a video using a Postgres DB. I highly recommend it for anyone not interested in dynamodb https://twitter.com/dabit3/status/1328710735619268608?s=21

simpson commented 4 years ago

+1 @duwerq, there are definitely other things i'd rather have prioritized on their roadmap over this. There are plenty of other good db options.

snahammond commented 3 years ago

@RossWilliams could you give an example of how to do a table stream lambda to implement the count. I am very new to amplify but really need this feature. Thanks

RossWilliams commented 3 years ago

@snahammond Here are the steps. Send me an email, LAST_NAME.FIRST_NAME@gmail.com if you get stuck.

At a high level, you are creating a lambda, connecting a lambda to the table change stream, adding permissions to allow the lambda to connect to the stream, and writing a lambda to handle create, update, and delete events. You are also going to create environment variables so you have the table names you need inside the lambda.

Create a new lambda function. You can use the CLI to create it. Have the function depend on your graphQL API. This means your backend-config.json for this lambda should have in its "dependsOn"
```
{
"category": "api",
"resourceName": "XXX_APP_NAME",
"attributes": ["GraphQLAPIIdOutput", "GraphQLAPIEndpointOutput"]
},
```

Open the function's audit-cloudformation-template.json file, add an item to "Resources" to attach the lambda to the Dynamo Table:

"DynamoTableXXX_TABLE_NAMEStream": {
"Type": "AWS::Lambda::EventSourceMapping",
"Properties": {
    "BatchSize": 20,
    "MaximumBatchingWindowInSeconds": 100,
    "BisectBatchOnFunctionError": true,
    "StartingPosition": "LATEST",
    "Enabled": true,
    "EventSourceArn": {
        "Fn::ImportValue": {
            "Fn::Join": [
                ":",
                [
                    {
                        "Ref": "apiXXX_APP_NAMEGraphQLAPIIdOutput"
                    },
                    "GetAtt",
                    "XXX_TABLENAME",
                    "StreamArn"
                ]
            ]
        }
    },
    "FunctionName": {
        "Fn::GetAtt": [
            "LambdaFunction",
            "Arn"
        ]
    }
},
"DependsOn": [
    "LambdaExecutionRole"
]
},

Your lambda function policy needs the following statement for permissions. you will need to add permissions for any additional operations you use. Keep the existing permissions for log groups that Amplify generates:

{
"Effect": "Allow",
"Action": [
    "dynamodb:DescribeStream",
    "dynamodb:GetRecords",
    "dynamodb:GetShardIterator",
    "dynamodb:ListStreams"
],
"Resource": {
    "Fn::Sub": [
        "arn:aws:dynamodb:${region}:${account}:table/XXX_TABLENAME-${apiId}",
        {
            "region": {
                "Ref": "AWS::Region"
            },
            "account": {
                "Ref": "AWS::AccountId"
            },
                               "apiId": {
                                       "Ref": "apiXXX_APP_NAMEGraphQLAPIIdOutput"
                               }
        }
    ]
}
}

You can adjust the batch size and batch window I provided above.

Add an environment variable for any tables you need to have access to inside the Lambda:

"LambdaFunction": {
"Type": "AWS::Lambda::Function",
"OtherItems": "existing items here",
"Properties": {
   "OtherItems": "existing items here",
   "Environment": {
      "Variables": {
      "ENV": {
          "Ref": "env"
      },
      "REGION": {
          "Ref": "AWS::Region"
      },
     "XXX_DYNAMO_TABLE_FOR_UPDATES": {
    "Fn::Join": [
        "-",
        [
            "XXX_DYNAMO_TABLE_NAME",
            {
                "Ref": "apiXXX_APP_IDGraphQLAPIIdOutput"
            },
            {
                "Ref": "env"
            }
        ]
    ]
},
}

Now you have a lambda that can connect to the table stream, here is a very minimum lambda function (forgive any mistakes, i'm writing in github) to read the stream information and update an item (this requires adding "dynamodb:UpdateItem" to the permissions above:
```
import { DynamoDBStreamHandler, DynamoDBStreamEvent, DynamoDBRecord } from "aws-lambda";
import DynamoDB, { AttributeMap, DocumentClient } from "aws-sdk/clients/dynamodb";
```

const documentClient = new DocumentClient(); const TABLE_NAME = process.env.XXX_DYNAMO_TABLE_FOR_UPDATES;

export const handler: DynamoDBStreamHandler = async (event: DynamoDBStreamEvent) => { for (const record of event.Records) { if (record.dynamodb.NewImage && !record.dynamodb.OldImage) { const newItem = DynamoDB.Converter.unmarshall(record.dynamodb.NewImage); if (newItem.XXX_INTERSTING_FIELD) { await documentClient.updateItem({ TableName: TABLE_NAME, Key: { table_pk_attribute: XXX_A_PARTITION_KEY, table_sk_attribute_if_exists: XXX_AN_OPTIONAL_SORT_KEY }, UpdateExpression: "SET #xxx_my_counter_item = :xxx_my_counter_increase", ExpressionAttributeNames: { "#xxx_my_counter_item": "xxx_my_counter_item" }, ExpressionAttributeValues: { ":xxx_my_counter_increase": newItem.XXX_INTERSTING_FIELD }, }).promise(); // TODO: handle failures } } else if (record.dynamodb.NewImage && record.dynamodb.OldImage) { // old and new item, meaning record has been updated. check if you need to increase a count or not } else { // no new item, meaning record has been deleted. you may need to decrement a count } } }

nubpro commented 3 years ago

My usecase for this is to show the total number of records on that particular day

My current idiomatic way of dealing with the limitation is to show a + icon beside the number if nextToken exists.

loganpowell commented 3 years ago

Just adding another voice of support for this. Would love a way to reduce my Appsync bill 🗡️ ❤️

EDIT: It looks like someone has found an override that might be easy to implement:

https://www.riccardosirigu.com/blog/aws-amplify-dynamodb-total-count-graphql-queries/

Can someone from the Amplify team confirm this?

PeteDuncanson commented 3 years ago

Just wanted to update on another method we've found of late. As usual it might not be suitable for everyone depending on how many records you have but if you enable DataStore (which by default can store/handle 10k records locally) then you can get the count of various queries, totals etc. by checking the total results returned when you query it:

// Gets ALL Todo's that we have stored in DataStore
const allTodos = await DataStore(Todo); 

// Get all active Todos
const allActiveTodos = (await DataStore(Todo)).filter( todo => todo.status === "active" );

console.info( "You've got " + allActiveTodos.length + " active todo's out of " + allTodos.length + " todos total" );

It might be a bit wasteful, it might not be enough for what you want to do but for us it has made this problem go away for now and we can rework it when we have more time and if it becomes a problem. For now though its workable! Hope that helps someone.

You can enable DataStore via the CLI amplify update api but becareful if you have existing data in there be aware about the need to back fill some of the new fields it adds aws-amplify/amplify-category-api#227

See https://docs.amplify.aws/lib/datastore/data-access/q/platform/js#query-data for more details.

xitanggg commented 3 years ago

Has anyone used TransactWriteItem to keep track of counts? Say your app let people create blog posts and you want to keep track of posts per user. You create a custom resolver that use TransactWriteItem that would 1. create a blog post and 2. add 1 count to user's total posts atomically, and this return success only if both actions pass through thanks to transaction's all or none design. This seems to be a better alternative to me than aggregating count in DynamoDB stream.

chrisyeung1121 commented 3 years ago

Has anyone used TransactWriteItem to keep track of counts? Say your app let people create blog posts and you want to keep track of posts per user. You create a custom resolver that use TransactWriteItem that would 1. create a blog post and 2. add 1 count to user's total posts atomically, and this return success only if both actions pass through thanks to transaction's all or none design. This seems to be a better alternative to me than aggregating count in DynamoDB stream.

Hi @xitanggg - I am actually implementing the same way too. Some issues I encounter is:

I have to quote the TableName and cannot refer it from elsewhere, so if I use amplify and have multiple environments, it becomes very troublesome
The documentation for Resolver Templates are very limited. I basically have to infer from various places here and there to know what functions there is.
It does not solve the issues if one wants to get the count for a filtered set of results.

Anyway, a snap of what I did in the request templates are

{
  "version": "2018-05-29",
  "operation": "TransactWriteItems",
  "transactItems": [
    {
      "table": "Todo-jwr---ygm4-dev",
      "operation": "UpdateItem",
      "key": {
        "id": {
          "S": $util.toJson($context.args.input.todoID)
        }
      },
      "update": {
        "expression": "",
        "
      },
    },
    {
      "table": "Comment-jwr---clygm4-dev",
      "operation": "PutItem",
      "key": #if( $modelObjectKey ) $util.toJson($modelObjectKey) #else { "id": $util.dynamodb.toDynamoDBJson($ctx.args.input.id) } #end,
      "attributeValues": $util.dynamodb.toMapValuesJson($context.args.input),
      "condition": $util.toJson($condition)
    }
  ]
}

Do you have any suggested solutions?

Edit 2021-01-18 I just checked with the amplify-cli, and it seems TransactWriteItems is not supported in cli yet....

xitanggg commented 3 years ago

I implemented TransactWriteItems and it works very well for my purpose. Under CustomResource.json, I basically created a customCreateBlogResolver that is attached to a CustomDataSource data source, which has a CustomDataSourceIAMRole that gives it accesses to the different tables it needs.

In regards to your questions,

I do have to quote the table name and env name in the resolver templates. I just set them up as variables instead of hardcoding everything, as shown below. I believe there has not been a way for AppSync to get env name natively, it is like a half-baked product). To automate it for multiple environment (I haven't implemented it yet), but I think it is a simple bash script to grab and insert those values from some amplify json file. (There are similar discussions for folks implement batch)
```
#set($AppSyncApiId="xxxxx")
#set($currentEnv="dev")
```

{ ... "transactItems": [ { "table": "Todo-${AppSyncApiId}-${currentEnv}", ... }, { "table": "Comment-${AppSyncApiId}-${currentEnv}" ... } ] }



2. The document has a lot floating around, but it is actually decent, you just have to dig it. [This is the doc I found that has been particularly helpful.](https://docs.aws.amazon.com/appsync/latest/devguide/resolver-mapping-template-reference-dynamodb.html)

3. Yeah, I don't need to find a filtered set of result, so I have not much to share. The thing about designing for Dynamo is that you are designing for the common access patterns due to the key-value nature of no-sql. You need to spend some time thinking it before designing it. It is not as flexible query as sql.

4. When you say `cli`, you mean `mock api`? If so, yes, `transact` is not supported for the simulator, neither is `batch`. It is no big deal though, just push it in the cloud and you can test it in the `app sync console`. You can make edit of the resolvers templates on the console as well so you don't have to make changes locally and push to the cloud (just beware that sometimes there is latency before the template gets updated once you click save). Also, debugging in VTL can be painful because there is no `console.log`. I sometimes `util.json(xxx)` the result and set it into a graphQL return variable to see it, instead of using cloudwatch. Hope this helps.

chrisyeung1121 commented 3 years ago

@xitanggg - Have you taken a look at amplify/backend/tags.json - I am wondering if that would serve as a place where you can declare some variables. I still am not familiar with the whole CloudFormation setup but would it be a direction to put ${currentEnv} and ${AppSyncApiId} for you?

xitanggg commented 3 years ago

As mentioned, I put it in the resolver template, so the .vtl file (as shown below) (I also updated the code in my comment before to help illustrate). I haven't looked into tags.json but it wouldn't be the place. Think of resolver template like a lambda function except it is written in vtl. It is a piece of function that doesn't have the concept of variables outside of its scope. And unlike lambda function, I don't think app sync supports inserting environment variables through cloudformation yet.

.
└── amplify/
    └── backend/
        └── api/
            └── xxxGraphQL/
                └── resolvers/
                    └── Mutation.customCreateBlog.req.vtl (This file)

Similar to other languages, vtl can also do import or include, e.g. #include( "myVariables.txt" ), so potentially it can be a centralized file to store variables and other vtl files just call include. Yet unfortunately, App Sync doesn't seem to support it at the moment: https://github.com/aws/aws-appsync-community/issues/101. That left the option to include those variables in every .vtl files you need.

pagameba commented 3 years ago

I'd like to chime in with support for having some kind of support for this in Amplify. In my particular use case, I have a multi-tenant setup where I need to keep track of counts of variable numbers of things per tenant. My current solution took quite a bit of research to arrive at, here is what I am doing:

Stats Table - this is a dynamodb table for just storing accumulated statistics. There is one entry per tenant with an id that can be used to retrieve that record. New entries are created when new tenants are created and all the stats are initialized to 0.

I added a custom mutation (I found this somewhere, didn't create it and this was the hard part for me) for this table with the following transform which atomically increments or decrements one of the fields in the stats table for a specific record:

#set( $expNames = {} )
#set( $expValues = {} )
#set( $expSet = {} )
#set( $expAdd = {} )
#set( $expRemove = [] )
#set( $expNumbers = [] )
#if( $modelObjectKey )
  #set( $keyFields = [] )
  #foreach( $entry in $modelObjectKey.entrySet() )
    $util.qr($keyFields.add("$entry.key"))
  #end
#else
  #set( $keyFields = ["id"] )
#end
#foreach( $entry in $util.map.copyAndRemoveAllKeys($context.args.input, $keyFields).entrySet() )
  #if( !$util.isNull($dynamodbNameOverrideMap) && $dynamodbNameOverrideMap.containsKey("$entry.key") )
    #set( $entryKeyAttributeName = $dynamodbNameOverrideMap.get("$entry.key") )
  #else
    #set( $entryKeyAttributeName = $entry.key )
  #end
  #if( $util.isNull($entry.value) )
    #set( $discard = $expRemove.add("#$entryKeyAttributeName") )
    $util.qr($expNames.put("#$entryKeyAttributeName", "$entry.key"))
  #else
    $util.qr($expSet.put("#$entryKeyAttributeName", ":$entryKeyAttributeName"))
    $util.qr($expNames.put("#$entryKeyAttributeName", "$entry.key"))
    #if ( $util.isNumber($entry.value) )
      $util.qr($expNumbers.add("#$entryKeyAttributeName"))
    #end
    $util.qr($expValues.put(":$entryKeyAttributeName", $util.dynamodb.toDynamoDB($entry.value)))
  #end
#end
#set( $expression = "" )
#if( !$expSet.isEmpty() )
  #set( $expression = "SET" )
  #foreach( $entry in $expSet.entrySet() )
    #if ( $expNumbers.contains($entry.key) )
      #set( $expression = "$expression $entry.key = $entry.key + $entry.value" )
    #else
      #set( $expression = "$expression $entry.key = $entry.value" )
    #end
    #if( $foreach.hasNext() )
      #set( $expression = "$expression," )
    #end
  #end
#end
#if( !$expAdd.isEmpty() )
  #set( $expression = "$expression ADD" )
  #foreach( $entry in $expAdd.entrySet() )
    #set( $expression = "$expression $entry.key $entry.value" )
    #if( $foreach.hasNext() )
      #set( $expression = "$expression," )
    #end
  #end
#end
#if( !$expRemove.isEmpty() )
  #set( $expression = "$expression REMOVE" )
  #foreach( $entry in $expRemove )
    #set( $expression = "$expression $entry" )
    #if( $foreach.hasNext() )
      #set( $expression = "$expression," )
    #end
  #end
#end
#set( $update = {} )
$util.qr($update.put("expression", "$expression"))
#if( !$expNames.isEmpty() )
  $util.qr($update.put("expressionNames", $expNames))
#end
#if( !$expValues.isEmpty() )
  $util.qr($update.put("expressionValues", $expValues))
#end
{
  "version": "2017-02-28",
  "operation": "UpdateItem",
  "key": #if( $modelObjectKey ) $util.toJson($modelObjectKey) #else {
  "id": {
      "S": $util.toJson($context.args.input.id)
  }
} #end,
  "update": $util.toJson($update),
  "condition": $util.toJson($condition)
}

I then added a new function to be triggered by DynamoDB which checks for INSERT/MODIFY and set it up to watch the tables I was interested in tracking stats on. If in these tables there is a change in the field I am interested in (I have a status field that changes and I want to track how many items are in each status for each tenant) then I invoke the graphql mutation to update the statistics with the field I want to update and the amount to update it by (usually 1 or -1).

A side effect of this is that I can use a subscription on the stats table with a filter to the current tenant to get live updates for the counts of things I'm interested in.

My current limitation and problem is that I need to know ahead of time what stats I have to track so the schema knows about the columns to track. The new requirements for the platform are to have a tenant-configurable set of things that need tracking. I have no idea what I can do with vtl in this regard, I'm hoping I can store an AWSJSON field, parse it, update a key inside it, then serialize it back all atomically so I don't miss any updates, although I am not holding my breath on this :)

Any feedback on this approach would be appreciated.

I understand how complicated an integrated solution would be, I'm wondering if documented patterns would be enough to help people solve specific use cases.

GeorgeBellTMH commented 3 years ago

Why doesn't someone just make a vtl that runs the list query to completion (including however many pages there are and whatever filters there are)...then have it just return a number...sure you have to scan the whole table, and it might be slow or costly etc...but add that information to the docs and let people have a way to get a total number of rows without having to return all the data....on the front end make a standard naming scheme and let it take the same params as the query:

ListXxxxxCount ListXxxxxQuery

SalahAdDin commented 3 years ago

What's about using aggregationQueries?

sacrampton commented 3 years ago

Contributing this in case its helpful to anyone. I implemented totals for my solution some time back. I did it with ElasticSearch and it was quite easy. Early on I guess I discovered that there are a lot of things that are very difficult with DynamoDB alone, but with ElasticSearch there you can get to where you need to be without too much drama. I've created highly customized ElasticSearch resolvers that give me powerful searches so I'd recommend ElasticSearch be part of a standard landscape.

To start with, you need to modify the response like the example below. My response also includes geodistance (https://docs.amplify.aws/cli/graphql-transformer/resolvers#add-a-custom-geolocation-search-resolver-that-targets-an-elasticsearch-domain-created-by-searchable) to return geospatial queries. I also modified it to include _version when you are using conflict detection for DataStore.

#set( $items = [] )
#set( $nextToken = [] )
#set( $geodistance = [] )
#foreach( $entry in $context.result.hits.hits )
  #if( !$foreach.hasNext )
    $util.qr($nextToken.add($entry.get("sort")))
  #end
  #set( $row = $entry.get("_source") )
  $util.qr($row.put("_version", $entry.get("_version")))
  $util.qr($items.add($row))
  $util.qr($geodistance.add($entry.get("sort")))
#end
$util.toJson({
  "items": $items,
  "total": $ctx.result.hits.total,
  "nextToken": $nextToken,
  "geodistance": $geodistance
})

Then for the searchable connection you need to modify the expected response to include total and geodistance

type Searchable<Model>Connection {
  items: [<Model>]
  nextToken: String
  geodistance: [String]
  total: Int
}

Then the GraphQL query is simply

query getTotal{
  search<Model>s(filter:{
    shortDescription:{eq: "MOTOR"}
  })
  {total}
}

Which returns the following response

{
  "data": {
    "search<Model>s": {
      "total": 39
    }
  }
}

houmark commented 3 years ago

@sacrampton good feedback but your implementation seems to suffer from the same issue every other approach similar has. If you return 39 results from ES but the user doesn't have access to some of them due to access configurations in AppSync / DynamoDB, then your count will be wrong.

sacrampton commented 3 years ago

Hi @houmark - for the standard AppSync resolvers that is correct. However, in my customized resolvers I've implemented the security access in the request resolver to ensure that the query only gets access to the records you have access to and thus returns the correct numbers of records.

My application requires dynamic authorization. Each record has a group field and each user has access to certain groups. However, dynamic @auth seems to create all sorts of issues and is not supported in a lot of instances (subscriptions, etc.) - so I implement that in the resolver request to build my ElasticSearch query to only show the groups that we should have access to.

This is a different approach to the DynamoDB queries - there the security filtering happens in the response - here I do it in the request.

ataibarkai commented 3 years ago

If I may add my 2 cents to this discussion:

There's an argument to be made that the difficulty in retrieving DyanmoDB record counts should be viewed primarily as an inherent/essential component of the design problem, rather than as an incidental/accidental component of the problem that can be 'hacked' away or abstracted away. (incidental = 'programmer's fault', essential = 'not programmer's fault').

The NoSQL database paradigm (of which DynamoDB is an instance) makes some things (like scaling horizontally) very easy, and some things (like counting/querying) more difficult.

Sure, from the perspective of the DynamoDB team, you could argue this is a problem worth attacking since it is very common and it does have viable (yet non-trivial) technical solutions (e.g. using hybrid NoSQL/relational approaches) -- but certainly from the perspective of Amplify users, for the foreseeable future this is just an inherent part of the design problem that must be contended with and solved for.

Luckily, there are solutions to many common instances of the counting problem in the NoSQL world, but they have to be explicitly designed for (e.g. counting likes/comments by incrementing/decrementing a counter on entry insertion/deletion, as has been suggested elsewhere in this thread).

I think what is fair to ask of the Amplify team -- especially since this functionality really is very basic in many real-world contexts -- is to make solving this problem as easy as possible given the fundamental design constraints, and remove those obstacles which are incidental.

For starters, the problem + common solutions should be documented in the official amplify docs. It's too important a problem to relegate to the 80-comment discussion thread on a 2 year old GitHub issue.

But more importantly, amplify should make implementing the viable solutions to this problem a lot easier.

For the most part, there are 2 bottom-up solutions to this problem, both already referenced in this thread:

If an approximate count is all that is required, incrementing relevant counters on creation/deletion/update is all that is needed. This can only give an approximate count because the 2 DB operations are decoupled and the counter update can fail. I just opened another issue (aws-amplify/amplify-category-api#184) since doing this is very difficult if you want to use a non-JS lambda function for this purpose.
If a precise count is required, then the best way to go is a dedicated mutation which uses DynamoDB transactions to create entries and update the relevant counters in unison (i.e. both operations succeed/fail together), as was already suggested above. Amplify could make this flow significantly easier to implement.

As a bonus, these 2 capabilities would carry over to lots of other contexts that have nothing to do with counting.

An official short-term plan on this from the Amplify team would be much appreciated, given that the problem is crucial to many folks using Amplify for real-world (rather than demo-app) purposes, and that the underlying complexities are not likely to change any time soon.

And while I'm at this, to add my own ad-hoc documentation to this issue, if you would like to include the relevant counters as parameters on your standard @model GraphQL types (rather than in some dedicated stats table), here's the approach I have arrived at:

It's using 2 3rd-party Amplify/AppSync directives:

@auto: https://github.com/hirochachacha/graphql-auto-transformer
@default: https://github.com/trek10inc/graphql-default-value-transformer

type Post
  @model(subscriptions: null)
  # ...
{

  # ...
  likes: [PostLike!]! @connection(keyName: "byPostLiked", fields: ["id"])
  likesCount: Int! @auto @default(value: "0")
    @auth(rules: [
       # include rules to make sure this parameter cannot be changed by any public API
    ])
}

hackrx commented 3 years ago

We want this feature, any update on this?, from AWS amplify developer teams. Thanks.

duwerq commented 3 years ago

This is almost exactly like @RossWilliams example but I pulled it from a project I impelemented it on too and thought I'd share. For anyone looking to add an increment/decrement count off a DB stream. First, create a lambda with the cli and name it something like updateCounts. Then drop this into the lambda

const db = new aws.DynamoDB.DocumentClient();

const updateCounts = async ({ increase }) => {
  const updateParams = {
    TableName: process.env.TABLENAME,
    Key: {
      id: "PK of counts record",
    },
    UpdateExpression: `SET #counts = if_not_exists(#counts, :defaultCount) ${
      increase ? "+" : "-"
    } :inc`,
    ExpressionAttributeNames: {
      "#counts": "counts",
    },
    ExpressionAttributeValues: {
      ":inc": 1,
      ":defaultCount": 0,
    },
  };
  try {
    const Items = await db.update(updateParams).promise();
  } catch (error) {
     console.log("do some error handling")
 }
};

exports.handler = async (event, context, callback) => {
  console.log("event record", JSON.stringify({ event }));
  await Promise.all(
    event.Records.map(async (record) => {
      if (record.eventName === "INSERT") {
        const dbRecord = aws.DynamoDB.Converter.unmarshall(
          record.dynamodb.NewImage
        );
        if (dbRecord.__typeName === "typeName of record that should update count record") {
          updateCounts({ increase: true });
        }
      }
      if (record.eventName === "REMOVE") {
        const dbRecord = aws.DynamoDB.Converter.unmarshall(
          record.dynamodb.OldImage
        );
        if (dbRecord.__typeName === "typeName of record that should update count record") {
          updateCounts({ increase: false });
        }
      }
    })
  );
};

Then go to amplify/functions/updateCounts/updateCounts-cloudformation-template.json and add the following

"Resources": {
....
 "DynamoDBStreamAccess": {
      "DependsOn": [
        "LambdaExecutionRole"
      ],
      "Type": "AWS::IAM::Policy",
      "Properties": {
        "PolicyName": "Lambda-DynamoDBStreamAccess-Policy",
        "Roles": [
          {
            "Ref": "LambdaExecutionRole"
          }
        ],
        "PolicyDocument": {
          "Version": "2012-10-17",
          "Statement": [
            {
              "Effect": "Allow",
              "Action": [
                "dynamodb:DescribeStream",
                "dynamodb:GetRecords",
                "dynamodb:GetShardIterator",
                "dynamodb:ListStreams"
              ],
              "Resource": [
                arn:aws:dynamodb:<region>:<your-account-id>:table/<TableName>/stream/2020-10-18T22:08:08.367", // Insert your dynamodb stream arn (can be found in DynamoDB console)
              ]
            }
          ]
        }
      }
    },
 "DynamodbstreamEventSourceMappingCounts": {
      "Type": "AWS::Lambda::EventSourceMapping",
      "DependsOn": "DynamoDBStreamAccess",
      "Properties": {
        "BatchSize": 10,
        "EventSourceArn": "arn:aws:dynamodb:<region>:<your-account-id>:table/<TableName>/stream/2020-10-18T22:08:08.367", // Insert your dynamodb stream arn (same as above)
        "FunctionName": {
          "Fn::GetAtt": [
            "LambdaFunction",
            "Arn"
          ]
        },
        "StartingPosition": "TRIM_HORIZON",
        "Enabled": "True"
      }
    },
 }

Run amplify push and you should be good to go. You can test by adding a record in DynamoDB in the console. It will trigger the lambda, from there you can console.log inside the lambda and check cloudwatch to see any values you need

On a side note, make sure you've created the counts record in DDB so you can reference it in the updateCounts function

zkwentz commented 3 years ago

Careful with lambdas that update the count. I currently do this, but you run into problems at scale, specifically: these lambdas that count are not idempotent, every time they run they increase or decrease the count (unless modified if that's part of your logic).

The problem is that lambdas are not guaranteed to run once if a record is updated. You could have, if a table has a bunch of updates on the same or different records, multiple batches of events sent to lambda with the same dynamo change. Meaning, you could increment the count too many times.

My approach to deal with this has been to add a LambdaLocks table that I check to see if I have seen an event record with this ID before. If I have, I stop work on this record, otherwise, I update LambdaLocks with the function name and record ID and then do my counting.

duwerq commented 3 years ago

@zkwentz but if you rely on INSERT or REMOVE and not MODIFY, this shouldn't be a problem, correct?

zkwentz commented 3 years ago

@duwerq unfortunately no, even insert and removes can be sent multiple times in my experience. I think it has something to do with the event batching and when a table is under heavy load? I'd love to know better from the Amplify / Dynamo team if that is the case.

duwerq commented 3 years ago

from the Docs:

"DynamoDB Streams helps ensure the following:

Each stream record appears exactly once in the stream.

For each item that is modified in a DynamoDB table, the stream records appear in the same sequence as the actual modifications to the item. "

More here https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html

I can't speak to your experience but it sounds like that shouldn't be possible

zkwentz commented 3 years ago

I can't recall specifically why this was happening but my understanding was that it was occurring on insert and remove.

With no ability to reproduce that same load that caused it now, I'll hold off on pushing this point further and defer to the docs, and add my own note which is to say, "ensure you don't have idempotency requirements when counting in a lambda."

RossWilliams commented 3 years ago

The Lambda function reading from the stream needs to send a response back to the stream to acknowledge the batch was processed. If the function returns an error, the items in the batch will be processed a second time (and potentially by a new instance of the function). See this guide from AWS: https://aws.amazon.com/blogs/compute/optimizing-batch-processing-with-custom-checkpoints-in-aws-lambda/

Minimise your chance of throwing.
Tell the stream which item failed in a batch
Send failures to a DLQ. Manually re-drive the items when you get alerted to items in the queue

houmark commented 3 years ago

While I appreciate all the alternative ways here, I don't see a solution without a lot of complexity that will allow returning counts for filtered queries (including when the query is sent to Elasticsearch). That is really what's needed in order to show a total result number even if only listing 20 items in a resultset for example. A global count is not really useful in many cases and attempting to pre-count the number of results expected for a certain filtered query is going to require quite a lot of work.

The AWS Amplify team has created an RFC in relation to @searchable and in a comment I requested that counts would be included in those improvements. If you support that, please voice you opinion here: https://github.com/aws-amplify/amplify-cli/issues/7546#issuecomment-873368552

sacrampton commented 3 years ago

Hi Everyone - whilst I agree that totals should indeed be something out of the box - our solution is around ElasticSearch and I thought I'd share what we did because custom searchable resolvers have been really important for us to flexibly query and totals is just one benefit.

Firstly, we implement the security model in the resolver itself. For example we use dynamic auth, but do it in the resolver. In the resolver below the user is given a group in Cognito equal to the Plant ID. The SHOULD query at the bottom means that whatever people query will have an additional filter to exclude anything they don't have access to it. By including the security model in the query resolver it means the TOTAL we get from the query will be accurate and only show values that we are authorized to see.

But in the truncated resolver below you will see we include a huge number of query options - MUST MATCH, MUST MATCH KEYWORD, MUST MATCH QUERY STRING, MUST EXISTS - then we have must not versions of that, and "or" / "should" versions of all that. Also the ability to query ranges (ie. date/number range greater than, date/number range between values and not/or versions. Also geospatial queries (ie. find all assets within 100m of this current asset). Majority of these queries are not even remotely possible with standard DynamoDB and any counter built into a table in DynamoDB would not be able to dynamically get these values. Also use searchAfter to paginate really large data sets which from/size does not.

Once you have built the resolver the Totals are dynamically output and reflect really powerful / flexible queries that are not possible elsewhere. The current searchable resolvers are really bare/bones minimal capability - but our experience is that replacing most of the content of existing resolvers is required. It has been really worthwhile to build out that capability for our solution. Totals just happen to be a nice by product of powerful/flexible queries.

Just including this to share in more detail how we went about it.

## The resolver is proprietary to our data model, and is very large, so I have stripped out a lot of the content to give you an idea of the overall logic that we are using to build custom 

## Get variables input from query and count everything that was submitted.

#set( $indexPath = "/asset/doc/_search" )
{ "version": "2017-02-28", "operation": "GET", "path": "$indexPath","params": {"body": {

  "search_after": [ 
  #set( $saloop = 0 )
  #foreach( $userSearchAfter in $userSearchAfters )
        "${userSearchAfter}" 
        #set( $saloop = $saloop + 1 )
        #if( $saloop < $numberOfSearchAfters ) , #end
  #end 
  ],
#else
"from": #if( $context.args.from ) $from #else 0 #end,
#end
"size": #if( $context.args.limit ) $size #else 10 #end,
"sort": #if( $numberOfGeoFilters > 0 )
            [{  "_geo_distance": {
                "order": "asc",
                "unit": "${unit}",
                "mode": "min",
                "distance_type": "arc",
                "geolocation": $util.toJson($ctx.args.geoFilter.location)
            }}],
        #else
          [
          #if( $context.args.sort )
              #foreach( $userSort in $userSorts )
                #if( !$util.isNullOrEmpty($userSort.field) && !$util.isNullOrEmpty($userSort.direction) ) {
                  #if( $userSort.string == "keyword" ) "${userSort.field}.keyword" 
                  #else "${userSort.field}" 
                  #end : { "order": "$userSort.direction" } },
                #end              
              #end
          #end    
          {"id.keyword" : { "order": "asc" } }],
        #end
"version": true,
"query":

    ## Enter MUSTMATCH Values
        if( $numberOfMustMatch > 0 )
          foreach( $userMustMatch in $userMustMatchs )
            ,{ "match" : { "${userMustMatch.field}": "${userMustMatch.query}"}}
          end
        end

    ## Enter MUSTMATCHKEYWORD Values

    ## Enter MUSTEXISTS Values

    ## Enter RANGESINGLE Values

    ## Enter RANGEDUAL Values

    ## Enter MUSTQUERYSTRING Values

## --------------------------- OR QUERIES 
    ## Enter ORMUSTMATCH Values

    ## Enter ORMUSTMATCHKEYWORD Values

    ## Enter ORMUSTEXISTS Values

    ## Enter ORRANGESINGLE Values

    ## Enter ORRANGEDUAL Values

    ## Enter ORMUSTQUERYSTRING Values

## --------------------------- MUST NOT MATCH ------------
    ## Enter MUSTNOTMATCH Values

    ## Enter MUSTNOTMATCHKEYWORD Values

    ## Enter MUSTNOTEXISTS Values

    ## Enter MUSTNOTQUERYSTRING Values

## ---------OR MUST NOT QUERIES -------------------------------------------
      ## Enter ORMUSTNOTMATCH Values

      ## Enter ORMUSTNOTMATCHKEYWORD Values

      ## Enter ORMUSTNOTEXISTS Values

      ## Enter ORMUSTNOTQUERYSTRING Values

## ---------------------------------------------------

    ## Enter GEOFILTER values
        #if( $numberOfGeoFilters > 0 )
          ,{"bool": {"filter": {
            "geo_distance": {
              "distance": "${distance}${unit}",
              "geolocation": $util.toJson($ctx.args.geoFilter.location)
          }}}}
        #end

    ## Enter SHOULD values for GROUPS
        #if( $XXXXGroup == false )
            #if( $numberOfGroups > 0 )
                ,{"bool": { "minimum_should_match": 1, "should" : 
            #end
            #if( $numberOfGroups == 1 )
                {"match": {"assetPlantId.keyword": "${userGroup}"}}
            #end
            #if( $numberOfGroups > 1 ) [
                #set ( $i = 0 )
                #foreach( $userGroup in $userGroups )
                    #if( $i > 0 ) , #end
                    #set ( $i = $i + 1 )
                    {"match": {"assetPlantId.keyword": "${userGroup}"}}
                #end ]
            #end }}
        #end
#end
          ]
        }
      }
    }
  }
}

houmark commented 3 years ago

@sacrampton Now that looks like something useful and also something that has the potential to make it into the Amplify core. Thanks a lot for sharing!

kgoyal98 commented 3 years ago

Amplify can support this automatically adding a count field whenever any list is defined. Amplify can internally update it whenever any element is inserted / deleted.

chris-lee-code commented 3 years ago

It's been a long discussion here, but why is this feature not officially there yet?

matart15 commented 3 years ago

So any amplify user never implements this or GoToLast button?

It should be an official feature.

If the total count is a heavy load, how about calculateTotalCount: Boolean as an option with the default value false?

warlord987 commented 3 years ago

this is very frustrating. a lot of our clients complain about using the next button multiple times to reach their page. I cannot implement the goto page without knowing the page number. any update on this?

biller-aivy commented 3 years ago

Since there is obviously no real support from DynamoDB here, which is fine, you should generally take a different approach. We have the @search directive, what if there is also the @sql or @athena directive from Amplify?

I myself currently have the setting: DynamoDB is searched 2 times a day by a crawler (AWS GLUE) and the new entries are stored in a SQL capable copy under S3 in parquet format.

So for tables I can get the additional information like count. Or "Jump to Page" would also work.

matart15 commented 3 years ago

The current problem I am facing is Search + pagination

user can search by title_start_with + age_between + status

Now I have to make my mode @searchable to jump specific pages, Plus I have to calculate the total count with a different query. Which is the whole implementation of DB search from scratch. What if a user can search by combination of 5 fields.

If I have to manually implement this, there is no merit in using @model. which is the almost whole reason I chose amplify over appSync + CDK + serverless mysql

duwerq commented 3 years ago

If you have records that are in the 15,000 or less range, a potential solution would be to grab the first set of results and do a "pagination" query on a separate GSI for the whole set of records

Won't that Query for the whole set of records be expensive? No, it doesn't have to be. In the new GSI you'll only use the KEYS_ONLY Attribute projections, ie PK/SK, which will only store the PK/SK.

Then do 1 query for all the results. 1 record would only be about 75 bytes and 1 query can return up to 1MB.

1MB / 70 bytes is about 14-15K records. Now you have a whole set of all the keys with indexes and when you want to query a specific page, just grab the PK/SK of the page index from your "pagination" query

@naseer036 could this work for you?

duwerq commented 3 years ago

The current problem I am facing is Search + pagination

user can search by title_start_with + age_between + status

Now I have to make my mode @searchable to jump specific pages, Plus I have to calculate the total count with a different query. Which is the whole implementation of DB search from scratch. What if a user can search by combination of 5 fields.

If I have to manually implement this, there is no merit in using @model. which is the almost whole reason I chose amplify over appSync + CDK + serverless mysql

I could be mistaken but I thought Elasticsearch returns hits.total for any query. Couldn't you use that? This is an old resolver template from Amplify early days:

#if (! $util.isNull($context.args.paging))
  #set($Integer = 0)
  #set($from = $util.defaultIfNull($Integer.parseInt($context.args.paging.nextToken), 0))
#else
  #set($from = 0)
#end

#set($result = { "items": [] })
#foreach($entry in $context.result.hits.hits)
  $util.qr($result.items.add($entry.get("_source")))
#end

#set($last = $from + $result.items.size())
#if($context.result.hits.total > $last)
  $util.qr($result.put("nextToken", $last))
#end

$util.qr($result.put("totalCount", $context.result.hits.total))

$util.toJson($result)

sacrampton commented 3 years ago

Yes @duwerq - it does and that's what I've been using for a couple of years - works brilliantly. But as I said before, you need to implement your security (auth) model in the request resolver to ensure your response resolver is giving accurate totals.

duwerq commented 3 years ago

Yes @duwerq - it does and that's what I've been using for a couple of years - works brilliantly. But as I said before, you need to implement your security (auth) model in the request resolver to ensure your response resolver is giving accurate totals.

Okay so you’re saying if you rely on the auth to filter records in the VTL request, it will give a an innaccurate total. Are you suggesting any auth params also be passed into the must match of Elasticseearch?

@sacrampton

aws-amplify / amplify-category-api