alpacahq / alpaca-py

The Official Python SDK for Alpaca API
https://alpaca.markets/sdks/python/getting_started.html
Apache License 2.0
605 stars 150 forks source link

Client should *not* un-paginate large results. Should return a `generator` that does this for you. #433

Open tnixon opened 7 months ago

tnixon commented 7 months ago

Is there an existing issue for this?

Is your feature request related to a problem? Please describe.

When fetching historical data, even simple queries (fetching all trades / quotes for a single symbol on a single day) can have very large result sets which are paginated by the API. The data client attempts to un-paginate these and load them all into a single return structure. This is very slow (probably the main cause for #204). It also means that the client consumer has no choice but to allow this process to run (single-threaded) until it completes, or potentially fails with an OOM or similar.

Describe the solution you'd like.

The client should return a data structure that gives easy access to the paginated results, without actually loading them. The consumer can then decide how to access these results - possibly by looping through them in a single-threaded manner, but potentially also by parallelizing this data-loading to make it more efficient. It would also give the consumer the option of serializing each page of results and so avoid the OOM issue of building very large data structures in memory.

A Python generator seems a natural way to provide this functionality. The client can return an object that contains a generator which will (when accessed) fetch the appropriate pages of data from the API. This might look something like:

client = StockHistoricalDataClient(...)

trades_request = StockTradesRequest(symbol_or_symbols='NVDA')
trades_resultset = client.get_stock_trades_resultset(trades_request)

for(page_data in trades_resultset):
    # do something with the data (summarize it, serialize it, etc.)

note here I'm assuming that trades_resultset is a generator

Describe an alternate solution.

Another way to address this is to provide a client method for fetching an individual page of results, something like:

client = StockHistoricalDataClient(...)

trades_request = StockTradesRequest(symbol_or_symbols='NVDA')
trades_resultset = client.get_stock_trades_resultset(trades_request)

for(page in trades_resultset.pages):
    page_data = client.get_stock_trades_data(trades_request, page)
    # do something with the data (summarize it, serialize it, etc.)

note in this example I'm assuming that trades_resultset is an object that contains a reference to an iterator over page symbols.

Anything else? (Additional Context)

Give the user the option of how to handle fetching large data. Don't force them to wait on a single-threaded and potentially failure-bound process.

tnixon commented 7 months ago

PS - I am willing to prepare a PR on this (as soon as I can carve out some time).