aws-amplify / amplify-category-api

The AWS Amplify CLI is a toolchain for simplifying serverless web and mobile development. This plugin provides functionality for the API category, allowing for the creation and management of GraphQL and REST based backends for your amplify project.
https://docs.amplify.aws/
Apache License 2.0
87 stars 75 forks source link

Total counts in GraphQL queries #405

Open houmark opened 5 years ago

houmark commented 5 years ago

Is your feature request related to a problem? Please describe. I think it's way past due that Amplify supports total counts in GraphQL queries. Other GraphQL based platforms has this built in by simply adding totalCount (or similar) in the query and no matter the limit, they'll get back the total count in addition to the (filtered) data.

Describe the solution you'd like This should at least work for DynamoDB backed models and of course also for search based queries that pass by ElasticSearch.

Describe alternatives you've considered Making a Lambda function that is a field in each model using the @function directive, but since we are both using listItems and searchItems with filters added, the implementation is not simple as we have to reapply those filters on the lambda function for getting the correct count.

Making custom resolvers seems like another not very fun route and not very scaleable or maintainable, and once again, this should be a "out of the box one liner" to have available as a developer. With either a Lambda or some other custom resolver I'm looking at hours or days of development.

Additional context This is a must have feature and there's not really any workaround for displaying total counts for systems with many items — at least that I know of. I read several bug reports, but none of them seems to have a simple solution. That it has not yet been developer by AWS is beyond my understanding, as pulling out counts is one of the most common things to do when developing web apps.

RossWilliams commented 5 years ago

There is a workaround for this and a reason why it is not automatically included as part of amplify - DynamoDB does not have an efficient way to count items, and can't in general perform aggregate operations. If you want to count items, write a custom lambda to read from the dynamo table stream and conditionally increment or decrement a counter on the table. Add a lambda resolver to read this counter value from the table, and add it to your schema.

Because dynamo can't perform aggregate operations, and you potentially want to get item counts after a filter expression is applied, having the cli provide this feature is a tall order.

houmark commented 5 years ago

Yeah, I aware of DynamoDB limitations, but asking the developer to develop correct counts is also a tall order. As I can have any combination of filters, I would need to count based on that and that seems like a huge implementation.

A count of items is something most apps use, so while I may be able to develop something that works decently, this feature request really is about either DynamoDB fixing this or Amplify creating some kind of sensible workaround that does not put all the workload on the developer.

Thanks for your feedback, the things you outline has been in my thoughts in relation to a workaround on this, but trying to come up with a sensible workaround that respects any type of filter on any table seems like something I could spend a week or two developing and may still be prone to wrong counts in edge situations.

Maybe ElasticSearch has counts of total results, even if there's a limit applied and this could be included in results as a decent workaround? We use ElasticSearch for most queries, as the default filters are too limited anyways.

RossWilliams commented 5 years ago

Elastic does provide this information, you can access it in the resolver template with: $ctx.result.hits.total.

There is a difficult disconnect between graphql and dynamodb development mindset. Dynamo DB requires planning of most access patterns in the design phase, while graphql makes an iterative approach seem straightforward. While amplify backed by dynamodb does offer some flexibility, it does require more ahead of time planning than other platforms, in this case with aggregates. If you know your aggregates now, and know they are stable, development of lambdas is doable. Solving the general case sounds much more difficult as long as your data is backed by dynamodb.

I agree that it would be helpful to have templates to speed up creation of lambda aggregate creators (including backfilling data).

houmark commented 5 years ago

Elastic does provide this information, you can access it in the resolver template with: $ctx.result.hits.total.

That sounds promising. Any hints on how I could create a @function directive to map to a resolver returning the total hits, without having to overwrite my default resolvers? I have not yet done any custom resolvers or overwritten in this project and I'm trying to avoid that. I have done a few @function directives that run a Lambda for other things (and that works well).

RossWilliams commented 5 years ago

Elasticsearch by default returns the total hit count, and by default its accurate up to around 10,000. So if you make a rest api call in your lambda you will get this back as a response. Just update you schema to include it as a property. You will likely need a new type to hold the items, nextToken, and total property.

jonmifsud commented 5 years ago

I do think that $ctx.result.hits.total should be exposed by default since it is readily available.

In terms of DynamoDB support, I'm not too sure. I mean it would be nice to have, but if you have a large dataset just getting the count on every request will be expensive. Because every time you go over 1 MB of data, you have to paginate through to get totals. In reality, you could scan through a table made up of millions of records before you can get a final result, which obviously makes no sense even in terms of performance.

undefobj commented 5 years ago

If we were to build this into Amplify in the future with DynamoDB, could you give a bit more detail on some things:

Any other specifics on your requirements or use cases would help us look into this in the future.

jonmifsud commented 5 years ago

@undefobj

I would say it would have to be the total count of items matching a query. This would be primarily needed for pagination purposes, it's very hard to give a good user experience if we are supposedly showing X results at a time out of Y number of records. This would also need to work with custom indexes, as most likely that's the most efficient setup when listing tables unless we are using searchable.

I wouldn't mind if there are additional lambda functions added to the GraphQL Query, however, I think it would be wise if we are able to choose whether to request/run this or not. If we are paginating across 1000 results with 10 results a time, I don't think we need to necessarily run the total length query 10 times, it could be that the first request is enough. Also, there could be instances where I wouldn't care much about the count and as a result wouldn't want to run the lambda function for no reason. Potentially another way to go around this could be to have a 'cache'/dynamo table store for counts; with rules of how long count results may be valid for.

I hope that makes sense.

houmark commented 5 years ago

I agree with @jonmifsud — the main interest for us is to have a count of the filtered query, so we can do a proper pagination and show the total results of that query.

A full count would be nice, but if you do a normal query, shouldn't that result in a count of all items? Also, maybe a way to just return the count would be nice instead of having the count as a part of the result dataset would be nice.

I also would like to be able to choose when to use this and when to not, to avoid extra processes running with load time and costs overhead.

undefobj commented 5 years ago

You should be able to do pagination using nextToken without needing to know the count. However, what you propose above to have a count on the query results would be difficult. While not impossible, this feature would be quite difficult to do. I'll keep the issue open so that we can get other community feedback and measure interest, but for transparency it would need a lot of design and thought and wouldn't be delivered anytime soon.

houmark commented 5 years ago

Associating the word "impossible" for a query count in 2019 makes me cringe a bit. And more than that, it makes me wonder if selecting Amplify and all its (current) dependents was a very wrong choice.

The fact that DynamoDB does not do counts for its queries (besides a full table estimated count every ~6 hours) is simply a limitation the team working on DynamoDB should solve. Every single competitor to DynamoDB handles this without issues, so I'm sure those smart people can also come up with a solution that does not just benefit AppSync and Amplify users, but also people using DynamoDB directly. Maybe it will be near correct counts if millions and more precise when thousands like MySQL / InnoDB, and that would be way way better than having no clue whatsoever.

I am aware that using the nextToken I can make pagination but that paginator is somewhat less cool to look at from a UX perspective as I won't be able to show 1, 2, 3, 4....12 because I don't know how many pages I have. When someone wants to know how many items to we have fitting this filter, it cannot be that I have to pull them all out (only the id field) in the leanest way, and then count the array client-side?

I'm sure AWS compares themselves in some ways with other GraphQL services like Prisma etc. and they don't seem to have a problem supporting this.

This is a DynamoDB limitation. Attacking a solution on top of that for AppSync is the wrong angle, this needs to end on the table of AWS DynamoDB developers so they can come up with a sensible solution nearer to the root of the problem — everything else is a hack. Asking me to keep counts in a model/table myself when things update is even worse and not what you'd expect of a platform with an otherwise impressive feature set.

And if it's not possible for DynamoDB to solve this, then the Amplify / AppSync team should start considering built-in support for other major database players such as MongoDB, MySQL, Postgres, etc. so they are not being held down by a half-baked database that is backing the entire thing, but when that is considered, I am sure it looks way more interesting to just figure out a solution to counts and other minor limitations DynamoDB currently has.

RossWilliams commented 5 years ago

@undefobj What I would like to see to support this type of feature is to utilise kinesis firehose and glue to send data in Redshift or S3 Parquet. Then I could connect to Athena as a serverless analytics system, query Redshift for a non-serverless solution, or have another lambda pick up the objects from S3 and send them into Postgres. Amplify is well placed here as the api category can pick up on schema changes to re-run glue, amplify can make life easier creating a new firehose for each table, and setup the lambdas to put dynamo stream data into firehose. I realise its a big ask, but the question can be generalised into "how can amplify better support analysis workloads"

I see a smaller and faster win on just providing an aggregation template lambda to deal with uncomplicated counters. I would not like aggregation counters enabled by default, as I don't want to pay for what I don't use.

undefobj commented 5 years ago

@undefobj What I would like to see to support this type of feature is to utilise kinesis firehose and glue to send data in Redshift or S3 Parquet. Then I could connect to Athena as a serverless analytics system, query Redshift for a non-serverless solution, or have another lambda pick up the objects from S3 and send them into Postgres. Amplify is well placed here as the api category can pick up on schema changes to re-run glue, amplify can make life easier creating a new firehose for each table, and setup the lambdas to put dynamo stream data into firehose. I realise its a big ask, but the question can be generalised into "how can amplify better support analysis workloads"

We do support Kinesis Streams in the Analytics category today: https://aws-amplify.github.io/docs/js/analytics#using-amazon-kinesis I have to check on the PR for supporting Firehose but I know it's in progress.

This seems to be an independent ask of the issue in this thread though as this is related to ingest. Total counts in a decoupled systems wouldn't be accurate on the pagination against DynamoDB as you'd run into consistency issues. For the analytics requests I think this would be a good ask in a separate Feature Request issue.

RossWilliams commented 5 years ago

I wasn't as interested in strong consistency or using it for pagination, but the total count calculation and getting aggregations in general. The ask seems to be about bringing other database system capabilities into amplify, which is where ddb stream -> firehose would come in handy as a building block.

houmark commented 5 years ago

This got sidetracked by different ideas, but I'd like to know if we can expect total counts for at least Elasticsearch based queries soon?

@undefobj $ctx.result.hits.total is in the resolver response template already, so there should be very little code needed to get this wired in all the way. This would help with a decent workaround for overall total counts for now. It seems doing this against normal list queries against DynamoDB is not anything to expect soon.

houmark commented 4 years ago

It seems like aws-amplify/amplify-cli#2600 made aws-amplify/amplify-cli#2602 happen, so here's to hoping that PR gets accepted fast.

sallar commented 4 years ago

@houmark isn't that PR related to @searchable attribute? I want to know how many items are in a collection in total.. don't know if that PR solves this issue.

For example, a user has many books in their library, and I want to show on their profile page that how many books are there in their library. I don't want to load all of their books, just to count it. It might be thousands.

houmark commented 4 years ago

Yeah, I first thought that PR would solve it due to the original code that was halfway baked in at the time. But that PR changed to just show the total of the returned items which is more or less useless because it's very easy to count the amount of results returned client side.

I don't think the Amplify team is working on a total results value due to the limitations and complications of passing through @auth.

kaustavghosh06 commented 4 years ago

@houmark I wouldn't call it "useless" since if you're getting paginated results say 10 results at a time, you might want to get the total hits (if you have more than 10 results) to display information like the number of pages on your UI.

houmark commented 4 years ago

@kaustavghosh06 But how can I do that? I get 10 when I limit 10 in the query. If I limit 20, total is 20? If I am on the "last page" and limit 20 but there are only 5 results left I get 5, but all of that I can do with result.whatever.items.length client-side.

If I try to pull out 1000 then I hit another limitation which is the result set being too large (don't remember the exact error, but it errors, and if I query a lot but not too much to hit the limit, then it takes, of course, many seconds for result which in a UI leads to racing situations, and I cannot cancel queries due to another limitation, which has a PR but that PR has been stuck now for a good long time). Anyways, If I have 3k items, then it will still give me 1000 in total, so that does not really help me for pagination does it?

Am I missing something here?

kaustavghosh06 commented 4 years ago

@houmark Do you have @auth on @searchable?

kaustavghosh06 commented 4 years ago

I'll speak to @SwaySway about this behavior. I would expect the total hits to be available to the client for pagination purposes.

houmark commented 4 years ago

That was what I expected also, but this changed in the PR because it would leak data about the total results if @auth is filtering out results due to "access control".

See his comment here: https://github.com/aws-amplify/amplify-cli/pull/2602#issuecomment-543844821

houmark commented 4 years ago

For us, first of all we would not have a leak because we don't have advanced access control in our models, so we would get the right amount of results, but even if we did, I would have liked to at least have the flexibility to allow "leaking" and getting the result as a developer. This could come with a warning by Amplify, but as there's no alternative to get totals in any type of way in Amplify (besides maybe writing and maintaining resolvers ourselves which seems somewhat a bad path) it would be nice to have some indicator to be able to show in the UI (for example "around 2.604 results").

Of course, the optimal solution would be that @auth hooked on to the result and added in the total after filtering, but I understand that's a major undertaking and would potentially slow down all queries.

sallar commented 4 years ago

I'm flabbergasted. How can a database system have a support for pagination, but give no information regarding the total results? How do you make the pagination UI then? Sure you can show "Next" until there isn't a nextToken anymore or do infinite scrolling, but those don't work for everyone. I almost cant believe it that DynamoDB can't give you a simple count.

psporysz commented 4 years ago

Too bad i wasn't aware of that limitation when the project had started. Looks like I have to switch over to RDBS like Aurora and do all the hard work of custom resolvers.

RossWilliams commented 4 years ago

@psporysz dynamodb doesn’t prevent this UI pattern, it only requires that you do additional work. You can setup a streams trigger to keep a record count yourself. You can then fetch the count alongside the fetch for data to display a total count.

Or, you can sort the items with a monotonically increasing key and have a stream trigger insert a whole number count on the item, this would then allow a skip and limit pattern on a secondary index.

Unless there are other significant reasons, DynamoDB can still handle many patterns with a bit more work.

psporysz commented 4 years ago

Or, you can sort the items with a monotonically increasing key and have a stream trigger insert a whole number count on the item, this would then allow a skip and limit pattern on a secondary index.

Let's say i have a secondary index for date field and I would like to know the number of items in between some date range. Can you explain how to apply above technique in this case?

davidbiller commented 4 years ago

@psporysz dynamodb doesn’t prevent this UI pattern, it only requires that you do additional work. You can setup a streams trigger to keep a record count yourself. You can then fetch the count alongside the fetch for data to display a total count.

Or, you can sort the items with a monotonically increasing key and have a stream trigger insert a whole number count on the item, this would then allow a skip and limit pattern on a secondary index.

Unless there are other significant reasons, DynamoDB can still handle many patterns with a bit more work.

What happens with 2 simultaneous posts? Is this safe under DynamoDB?

odiaseo commented 4 years ago

Can anyone just give an example of how to get the the total hits from elastic search?

houmark commented 4 years ago

You can't get totals from Elasticsearch, Amplify/AppSync strips that information as the @auth directive can be filtering the final result of items if the user does not have access.

Effectively, at this time, there's no out of the box way to get even close to a precise count of a query result.

ghost commented 4 years ago

Is there any update on this issue? I'm just trying to get the total count on ElasticSearch using @searchable, but the service is not returning the total count. It just returns the paginated total count. If I send a no limit, then the total count becomes 10, since it seems to be the default limit... I'm just trying to count the number of items hit by some filter...

danielcai1987 commented 4 years ago

In AppSync Console, I've tried to add a total in the schema

type ModelUserConnection {
    items: [User]
    nextToken: String
    total: Int
}

And attach a Resolver to the ModelUserConnection -> total

In "Configure the request mapping template." section

{
    "version" : "2017-02-28",
    "operation" : "Scan",
    "filter" : {
        "expression" : "#return( $util.toJson($ctx.source.scannedCount) )"
    }
}

In "Configure the response mapping template." section

$util.toJson($ctx.result)

then I can query the total

query userList{
  listUsers{
    total
  }
}

result would be

{
  "data": {
    "listUsers": {
      "total": 6
    }
  }
}

I'm new to Amplify and AppSync, so not very sure whether this is a correct method to do this.

Reference: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Scan.html#Scan.Count https://docs.aws.amazon.com/appsync/latest/devguide/resolver-mapping-template-reference-dynamodb.html#aws-appsync-resolver-mapping-template-reference-dynamodb-filter https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/LegacyConditionalParameters.ScanFilter.html

davidbiller commented 4 years ago

I've tried to add a total in the schema

type User @model {
  id: ID!
  name: String!
  total: Int
}

In AppSync Console, attach a Resolver to the ModelUserConnection -> total

In "Configure the request mapping template." section

{
    "version" : "2017-02-28",
    "operation" : "Scan",
    "filter" : {
        "expression" : "#return( $util.toJson($ctx.source.scannedCount) )"
    }
}

In "Configure the response mapping template." section

$util.toJson($ctx.result)

then I can query the total

query userList{
  listUsers{
    total
  }
}

result would be

{
  "data": {
    "listUsers": {
      "total": 6
    }
  }
}

I'm new to Amplify and AppSync, so not very sure whether this is a correct method to do this.

Reference: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Scan.html#Scan.Count https://docs.aws.amazon.com/appsync/latest/devguide/resolver-mapping-template-reference-dynamodb.html#aws-appsync-resolver-mapping-template-reference-dynamodb-filter https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/LegacyConditionalParameters.ScanFilter.html

It is a solution yes. But a scan operation is expensive ans slow. If you have 6 items, its okay. But if you have 100k its not good to use this option.

danielcai1987 commented 4 years ago

@davidbiller Thanks for the reply. In your opinion, what is the good option? Thanks!

RossWilliams commented 4 years ago

@danielcai1987 scan is limited to reading 1MB of stored data at a time, if you ever expect this table to hold more data, the above solution will not work

davidbiller commented 4 years ago

@danielcai1987 https://medium.com/signiant-engineering/real-time-aggregation-with-dynamodb-streams-f93547cfb244

samstr commented 4 years ago

I can't believe this is still unresolved

davidbiller commented 4 years ago

This is a technical limitation. Don't think anything's gonna happen here anytime soon.

simpson commented 4 years ago

Anything on this?

narayanncube commented 4 years ago

Yeah.

Any kind of app will require this unless the solution is for playground only

ricsirigu commented 4 years ago

It's possible to have scannedCount and count as result from the GraphQL query

read here

sallar commented 4 years ago

Thanks. I wonder why is there no official documentation about this.

davidbiller commented 4 years ago

Thanks. I wonder why is there no official documentation about this.

Because the 1 MB limit. So it’s not really a Option.

lexfaraday commented 4 years ago

I hope that this technical limit from DynamoDB gets solved, but modifying the Graphql scheme is not an option for me, in the meanwhile I've created a new DynamoDB table with the owner and count fields, this table is updated vía lambda trigger after each insert/delete in the other DynamoDB table. After that when I need to know the total count by owner, I do in parallel the query to the counter table and to the data from the other DynamoDB table (limit of 10 results), with this I have isolated the counts by owner without to do a full scan and I have the total pages and total elements available for a good UX experience.

P.D: You could use the @function in the Graphql scheme to do that.

Maybe this solution could helps to you.

tkjaergaard commented 4 years ago

Just finished reading through this steam of comments. Many of you guys have done a ton of research, thank you! Really saved me a lot of time!

I'm gonna join the choir here, and ask that this issues is being recognised soon. It's really a pain to not have this functionality.

I've chosen to follow the path with a separate table where i save the aggregated (incremented) results base of the triggers. Not perfect, and dosen't solve the totalCount with filters like others have mentioned.

Hernanm0g commented 4 years ago

I've tried the @ricsirigu suggestion here but when limit = 10 i have:

{
  items: [10 items],
  nextToken: "asdasdasd",
  scannedCount: 10,
  count: null
}

The length of my result set (without filtering) is about 200 rows and the full table cardinality is >200k rows.

¿Any suggestion?

My query is based on a Secondary Index named "byMerchant" and the query is "transactionsbyMerchant"

My schema.graphl

type ModelTransactionConnection {
  items: [Transaction]
  nextToken: String
  scannedCount
  count: Int
}

type Transaction @model 
@key(name: "byMerchant", fields: ["merchant"], queryField: "transactionsByMerchant")
{
  id: ID!
  merchant: ID
  ...
}
RossWilliams commented 4 years ago

@Hernanm0g limit is applied before filter in DynamoDB, this is the opposite from other databases. Using the blog post to pagination will not work. https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Query.html

Some more information. DynamoDB will only ever read 1MB of data from disk from a query or scan. You cannot find the total count by reading the items from disk. If this were not the case, you could potentially ask DynamoDB to read 1PB of data and degrade performance for other users.

Counts are a type of database aggregation that you must manage manually with DynamoDB. Consider using a table stream lambda to manage the counts manually.

mbappai commented 4 years ago

I hope that this technical limit from DynamoDB gets solved, but modifying the Graphql scheme is not an option for me, in the meanwhile I've created a new DynamoDB table with the owner and count fields, this table is updated vía lambda trigger after each insert/delete in the other DynamoDB table. After that when I need to know the total count by owner, I do in parallel the query to the counter table and to the data from the other DynamoDB table (limit of 10 results), with this I have isolated the counts by owner without to do a full scan and I have the total pages and total elements available for a good UX experience.

P.D: You could use the @function in the Graphql scheme to do that.

Maybe this solution could helps to you.

Glad to see a lot of research has already been done on this issue, really hope to have it resolved soon. In the meantime, this solution you provided really seem ideal for my current use case, however I am quite new to aws. If it's not much trouble could you further elaborate on the hint you provided here as to how to make a trigger out of the @function directive which I presume will be updating the count table by listening to the insert&delete event in the source table.

esteban-uo commented 4 years ago

Is this already considered by any AWS team? Some words about it will give us hope to this limitation. Thanks!