aws / aws-sdk-pandas

pandas on AWS - Easy integration with Athena, Glue, Redshift, Timestream, Neptune, OpenSearch, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL).
https://aws-sdk-pandas.readthedocs.io
Apache License 2.0
3.93k stars 698 forks source link

wr.timestream.query is very slow due to large number of pagination requests #1308

Closed sirgarfieldc closed 2 years ago

sirgarfieldc commented 2 years ago

my call to query api

wr.timestream.query(query_str)

took 50 seconds to return about a dataframe of size 6MB containing 80K records.

I can see that a majority of the 50 seconds is taken up by dispatching a large number of paginate request one after another.

I have tried to change the pagination config from default value below.

{ "MaxItems": 10, "PageSize": 10, "StartingToken": "…"}

But I am unable to work out what value should "StartingToken" be.

I tried to set "MaxItems" = 100,000 and "PageSize" = 1000 without specifying "StartingToken", but only 20K records get returned and the entire call still took close to 40 seconds.

Question:

How should I make large query faster by i) correctly increase page size or ii) parallelize request for different pages ?

kukushking commented 2 years ago

@NakedKoala StartingToken is required only if you wish to start from a specific page and you get it in the response of the request to get the previous page. The pagination config you are using is correct, only note this:

Note
Services may choose to return more or fewer items than specified in the PageSize argument depending on the service, the operation, or the resource you are paginating.

In addition, there are limits on size of the response:

If the size of the row (including the query metadata) exceeds 1 MB, then the query will fail with the following error message:

Query aborted as max page response size has been exceeded by the output result row

Not exactly sure why you're only getting 20K, perhaps a single page is over 1MB or you're getting throttled by timestream. Try smaller page sizes, also note you can iterate over each page if you pass chunked=True and wouldn't need to wait for all requests to complete and start partially processing the results:

for df in wr.timestream.query(query, chunked=True, pagination_config={"MaxItems": 100000, "PageSize": 500}):
    df.head()
sirgarfieldc commented 2 years ago

@kukushking

my usecase ( some global analytics ) can't work with partial result.

Looks like the underlying service timestream is not suitable for this query pattern ( where i need to query hundreds of MB's worth of data quickly )

I am considering periodically coping raw data from timestream to some other AWS data store service which supports fast & efficient dumping of large amount of data. For this, do you have any suggestion ?

The dumpest idea would be just to pull raw data from timestream and dump it onto csv and put it on S3.

kukushking commented 2 years ago

@NakedKoala If you can't work with partial results try to retrieve by page and then pd.concat the data. Hundreds of MBs should still be OK with API, that is only my hyphothesis regarding the limits (also It's row size that is limited to 1MB, not the entire response size)