aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.93k stars 698 forks source link

Pagination for TimestreamDB #838

Closed sutiv closed 3 years ago

sutiv commented 3 years ago

Queries can take a long time and the AWS API gateway times out.

Is there a possibility to use pagination as offered by boto3? https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/timestream-query.html#TimestreamQuery.Client.get_paginator

kukushking commented 3 years ago

Hi @sutiv pagination is implemented but It appears to be using default config which essentially means no pagination. I'll add a parameter for pagination details.

sutiv commented 3 years ago

thanks a lot!

sutiv commented 3 years ago

Almost perfect, but isn´t there something missing?! Something like next_token = page.get('NextToken').

paginator.paginate() will return a 'NextToken', if a 'StartingToken' was given AND the page isn´t the last page. This NextToken isn´t returned by your solution and thereby I can´t call the next page.

kukushking commented 3 years ago

Actually current implementation lists through all pages and collects the results so you don't have to:

    for page in paginator.paginate(QueryString=sql, PaginationConfig=pagination_config or {}):
        if not schema:
            schema = _process_schema(page=page)
        for row in page["Rows"]:
            rows.append(_process_row(schema=schema, row=row))

Although now that I think about it, it would be useful to be able to iterate through pages in case result set is too big. I'll add this.

jaidisido commented 3 years ago

Available in 2.11.0

jeffngo commented 2 years ago

Almost perfect, but isn´t there something missing?! Something like next_token = page.get('NextToken').

paginator.paginate() will return a 'NextToken', if a 'StartingToken' was given AND the page isn´t the last page. This NextToken isn´t returned by your solution and thereby I can´t call the next page.

There is still a problem with the current implementation. 'NextToken' is still not part of the return value. What's the point of adding support for pagination when you don't return the pagination token?

kukushking commented 2 years ago

@jeffngo you don't have to retrieve the next page manually using a token. Pass chunked=True and wrangler will return an iterator of data frames each corresponding to the pages in the result set that you would be able to iterate lazily.

jeffngo commented 2 years ago

@kukushking Does that mean that awswrangler returns the full result set along with an iterator? If that's true, this solution will not scale well for a large dataset. AWS Timestream has implemented token-based pagination so that end-users can fetch a smaller subset of the full dataset within each request, and use NextToken in the next request to fetch the next page of results. Is there any chance we can return the NextToken as part of the response?

kukushking commented 2 years ago

@jeffngo no, if you pass chunked=True it will not read full result set at once - it only retrieves the current page, until you ask the iterator for the next one.

dfs = wr.timestream.query(sq="...", chunked=True) # returns an iterator, does not retrieve any results
for df in dfs:
    print(df) # retrieves and returns the df for the current page only

Just make sure you pass chunked=True to enable this behavior, otherwise it will indeed retrieve full result set.

jeffngo commented 2 years ago

@kukushking I see. In my application, we want to return a pagination token to the client so that the client can decide when to go to the next/previous page. Is there a way to pull the next pagination token out of dfs in your example above?