Unified property for the type of pagination per endpoint or exchange

xmatthias commented 4 years ago

OS: linux
Programming Language version: python
CCXT version: 1.18.1060
Exchange: binance / kraken ...
Method: fetchTrades, others...

inspired by https://github.com/ccxt/ccxt/issues/5683#issuecomment-521753472 (but i think this deserves it's own issue, and i could not find any place this was worked on.

Problem

The current way of pagination is very inconsistent / not well unified (which is also documented...). It's up to the user to detect which methods for which exchanges require date-based (since=) pagination, and which require id-based pagination.

Ideally, users could supply an argument next_page=True - which would handle pagination internally, but i don't think that's possible since it would require to keep the last pagination id per pair(?).

Assuming the above is not possible, an attribute (without going to the exchange's api documentation) expressing which method needs to be used for which exchange (and endpoint) would be highly appreciated.

As the documentation states: 'from_id': from_id, # exchange-specific non-unified parameter name

Possible solution

Unifying this should be "as easy" (it probably isn't, i've never done such unifying) as adding a from_id parameter to the header of the method, which defaults to None and is not used in case of date-based pagination, but is used for both ID and cursor based pagination.

# Current function header
    def fetch_trades(self, symbol, since=None, limit=None, params={}):
# possible future function header
    def fetch_trades(self, symbol, since=None, limit=None, from_id=None, params={}):

This would move "defining" the correct additional parameter to ccxt instead of the users.

Combined with something like exchange.describe()['options']['fetchTradesType'] == "page" and exchange.describe()['options']['fetchTradesPageStart'] == "0" (probably not the location you'd like to have this) - should allow flexible usage of this method.

Sample of the problem:

import ccxt
from datetime import datetime, timedelta, timezone
ct = ccxt.binance()

pair = 'ETH/BTC'
since = datetime.now(tz=timezone.utc) - timedelta(hours=2)
until = int(datetime.now().timestamp() * 1000)

trades = []
since_ms = since.timestamp() * 1000
# Initial call - made using since to get intial ID
t = ct.fetch_trades(pair, since=int(since.timestamp() * 1000))
from_id = t[-1]['id']
trades.extend(t)
while True:
    t = ct.fetch_trades(pair, params={'fromId':from_id}, limit=1000)
    # For kraken use the following:
    #t = ct.fetch_trades(pair, params={'since':from_id}, limit=1000)

    if len(t):
        from_id = t[-1]['id']
        trades.extend(t)
        print(from_id, len(t))
        if until and t[-1]['timestamp'] > until:
            break
    else:
        break

Problematic part in my eyes:

 t = kraken.fetch_trades(pair, params={'since':from_id}, limit=1000)
 t = binance.fetch_trades(pair, params={'fromId':from_id}, limit=1000)

kroitor commented 4 years ago

@xmatthias in short, we are aware and we totally agree with you. Because there are exotic exchanges that do both id-based and time-based pagination depending on the endpoint, this has to be defined either as an exchange-wide property or as metainfo per each unified method. Your suggestions on a proper unification scheme are welcome )

Unifying this should be "as easy" (it probably isn't, i've never done such unifying) as adding a from_id parameter to the header of the method.

Unfortunately, this is not as easy, because that won't work with other languages. But we agree in general. We will address this issue in nearest future, hopefully.

eabrouwer3 commented 4 years ago

There's generators in JS, PHP and Python. Could we use that some how? And each time we do the equivalent of next(get_trades()), it goes and gets the next "page". Or, if it's symbol based, the next "symbol". Whatever it does, it just keeps returning another list/array with a yield command. Or it loops through everything that came back and yields the next one. That way it's easier to use in a loop/map/filter/reduce thing. Thoughts on this? Or something similar? That way it's also lazy, so it'll only get the next one if you need it/ask for it. This is a very high level idea, but I hope it makes sense.

kroitor commented 4 years ago

@eabrouwer3

Could we use that some how?

Yes, however, this particular issue is more about unifying the data format that would be used to build more complex traversing algorithms on top of that data. So, the question is less about building the generators themselves and more about unifying the properties for all types of pagination, including limits, maximums, minimums, date-based pagination, id-based pagination, restrictions on how far back into the past you can go, etc, etc. Due to the differences between exchanges, the unification for the pagination metadata (for all methods) is not as easy as it seems at first, but we think we will be able to roll out a good proposal soon. Thx!

eabrouwer3 commented 4 years ago

Ahhh. I see. Thanks @kroitor.

YuriyTigiev commented 4 years ago

I have a question regarding a sample from this post

should we add one to the from_id before use it?

Old from_id = t[-1]['id'] t = ct.fetch_trades(pair, params={'fromId':from_id}, limit=1000)

New from_id = t[-1]['id'] + 1 t = ct.fetch_trades(pair, params={'fromId':from_id}, limit=1000)

kroitor commented 4 years ago

@YuriyTigiev in general the id of a trade is a string (it can take any form like '123456789' or 'abcdef-foo-bar'), so we can't do arithmetics with it. Instead, we should set the "from-id" to the last received id, and then filter out duplicates by id.

YuriyTigiev commented 4 years ago

I'm sorting all trades by id (fromId), for an analyze historical data step by step. If fromId is not number, how I can sort all trades in the right historical order? The timestamp is not unique.

kroitor commented 4 years ago

@YuriyTigiev in some cases you can sort by timestamp+id if you know for sure that the ids are numeric. However, that will be exchange-specific since it won't work for the exchanges that use hashes as ids. In a general case, you should sort by timestamp even if it is not unique. In some cases you may need to look into the info of every trade for more clues on the ordering.

YuriyTigiev commented 4 years ago

@YuriyTigiev in some cases you can sort by timestamp+id if you know for sure that the ids are numeric. However, that will be exchange-specific since it won't work for the exchanges that use hashes as ids. In a general case, you should sort by timestamp even if it is not unique. In some cases you may need to look into the info of every trade for more clues on the ordering.

In cases when fromId is a hash we can't use it for pagination (t = ct.fetch_trades(pair, params={'fromId':from_id}, limit=1000)) because order can be incorrect.

kroitor commented 4 years ago

@YuriyTigiev yes, that is correct. Also, sometimes, the exchange may provide pagination hints in the fetch_trades response, which is accessible in the .last_json_response property after the call, however, that is also exchange-specific.

YuriyTigiev commented 4 years ago

How will be working the method if I pass both parameters "since" and "fromId"? await exchange.fetchTrades(symbol = pair, since = current, params={'fromId':prev_id}, limit = limit)

kroitor commented 4 years ago

@YuriyTigiev it will send both params and will filter the results for timestamp > since.

YuriyTigiev commented 3 years ago

Why fetchTrades for binance return radnomly numbers of records? For one pair could return 107, 23, 1000, 2, 50 await exchange.fetchTrades(symbol = pair, params={'fromId':id}, limit = 1000)

kroitor commented 3 years ago

@YuriyTigiev it's hard to answer without your code and verbose output.

https://github.com/ccxt/ccxt/wiki/FAQ#what-is-required-to-get-help (if you ever have any question or issue – we always ask for that info, namely the code and the verbose request/response sequence).

In general, if you're watching the most recent trades this way, you will get all new trades starting after the specified id. Because trades happen randomly with the exchange (depend on the activity of the users and pairs) – could be any random number of new trades. The number of new trades since your previous request varies over time – this comes naturally from the definition of free market trading.

In other words, it could be pretty much the expected behavior.

kroitor commented 3 years ago

@YuriyTigiev you may also want to look through these issues carefully:

That could shed some light on your question. Yet still we will need you to follow the FAQ and paste the code and a complete verbose output in order to investigate.

YuriyTigiev commented 3 years ago

Last question

https://github.com/binance-exchange/binance-official-api-docs/blob/master/rest-api.md#old-trade-lookup-market_data GET /api/v3/historicalTrades

How fetch_trades calculate the first fromId for the method fetch_trades based on the parameter since? The binance method which historicalTrades doesn't have the parameter "since" but has a parameter fromId only

I have copied part of code from the first post.

t = ct.fetch_trades(pair, since=int(since.timestamp() * 1000))
from_id = t[-1]['id']
trades.extend(t)
while True:
    t = ct.fetch_trades(pair, params={'fromId':from_id}, limit=1000)

kroitor commented 3 years ago

The binance method which historicalTrades doesn't have the parameter "since" but has a parameter fromId only

Exactly, and this is why...

How fetch_trades calculate the first fromId for the method fetch_trades based on the parameter since?

... it does not. If the underlying endpoint does not accept a specific parameter – that parameter is simply ignored or not sent towards the exchange. So, with the historicalTrades endpoint the since argument is ignored by the exchange. If you're using the historicalTrades endpoint, Binance returns the most recent trades or the trades with fromId. The since argument is irrelevant at the moment of your request. And upon receiving the reply with the set of trades from Binance, the CCXT library filters them by since on the user side (whatever set it received).

However, that is just a half of the story. If you've read the above links carefully, you've probably noticed, that Binance provides more than one endpoint for public trades:

trades – https://binance-docs.github.io/apidocs/spot/en/#recent-trades-list
historicalTrades – https://binance-docs.github.io/apidocs/spot/en/#old-trade-lookup
aggTrades – https://binance-docs.github.io/apidocs/spot/en/#compressed-aggregate-trades-list

The aggTrades is the default endpoint in CCXT. But you can choose which of the three endpoints you want to use and configure that with the exchange-specific option named exchange.options['fetchTradesMethod'], as shown here:

https://github.com/ccxt/ccxt/blob/master/js/binance.js#L310

For example:

import ccxt

exchange = ccxt.binance({
    'enableRateLimit': True,
    'options': {
        'fetchTradesMethod': 'publicGetHistoricalTrades',  # or publicGetTrades or publicGetAggTrades (default)
    }
})

# your code here...

Configuring the exchange-specific options is documented in the CCXT Manual:

So, depending on which endpoint you choose, this or that argument or parameter is used to paginate over trades according to Binance API docs, as linked above.

Let me know if that does not answer your question.

YuriyTigiev commented 3 years ago

Hi,

I had a problem with download historical data from Binance when copied data day by day. I used fetchTrades and the parameter "since". In this case, the function works with low accuracy and can skip data for a period. I wrote my function FindNearestFromId which based on parameter "since" helps to find the nearest fromId for condition a "found timestamp" >= "parameter since".


import ccxt

API_KEY = ''
SECRET_KEY = ''

exchange_class = getattr(ccxt, 'binance')
exchange = exchange_class({
    'apiKey': API_KEY,
    'secret': SECRET_KEY,
    'timeout': 30000,
    'defaultType': 'spot',
    'enableRateLimit': True
})

dt = '2020-06-02T10:14:15.568Z'
since = exchange.parse8601(dt)
pair = 'ETH/BTC'

def FindNearestFromId(pair, since):

    s = exchange.fetch_trades(pair, params={'fromId':'1'}, limit=1)
    e = exchange.fetch_trades(pair, limit=1)

    sts = int(s[0]['timestamp'])
    ets = int(e[0]['timestamp'])

    sid = int(s[0]['id'])
    eid = int(e[0]['id'])

    if(not (sts <= since <= ets) ):
        return None

    while True:

        if( (sid == eid) ): 
            return sid
        if( ( sid == eid - 1) and ( since - sts) <= (ets - since)  ):
            return sid + 1
        if( ( sid == eid - 1) and ( since - sts) > (ets - since) ):
            return eid

        cid = (eid + sid) // 2
        c = exchange.fetch_trades(pair, params={'fromId':f'{cid}'}, limit=1)

        cts = int(c[0]['timestamp'])
        cid = int(c[0]['id'])

        if( ( sts < since <= cts ) or  ( sts <= since < cts ) ):
            eid = cid
            ets = cts
        elif( ( cts < since <= ets ) or ( cts <= since < ets ) ):
            sid = cid
            sts = cts 

        pass

    return None 

fromId = FindNearestFromId(pair, since)
f0 = exchange.fetch_trades(pair, params={'fromId':f'{fromId}'}, limit=1)
f1 = exchange.fetch_trades(pair, params={'fromId':f'{fromId-1}'}, limit=1)
f2 = exchange.fetch_trades(pair, params={'fromId':f'{fromId+1}'}, limit=1)

original =  exchange.fetch_trades(pair, since=since, limit=1)

print(f"source:\t\t{dt}, {since}")
print(f"found:\t\t{f0[0]['datetime']}, {f0[0]['timestamp']}, delta(ts) = {f0[0]['timestamp'] - since}")
print(f"found-1:\t{f1[0]['datetime']}, {f1[0]['timestamp']}, delta(ts) = {f1[0]['timestamp'] - since}")
print(f"found+1:\t{f2[0]['datetime']}, {f2[0]['timestamp']}, delta(ts) = {f2[0]['timestamp'] - since}")
print(f"original:\t{original[0]['datetime']}, {original[0]['timestamp']}, delta(ts) = {original[0]['timestamp'] - since}")

Result:

source:     2020-06-02T10:14:15.568Z, 1591092855568
found:      2020-06-02T10:14:15.636Z, 1591092855636, delta(ts) = 68
found-1:    2020-06-02T10:14:15.533Z, 1591092855533, delta(ts) = -35
found+1:    2020-06-02T10:14:15.742Z, 1591092855742, delta(ts) = 174
original:   2020-06-02T11:14:15.292Z, 1591096455292, delta(ts) = 3599724

kroitor commented 3 years ago

@YuriyTigiev you might also want to check these examples with deduplication by id, thus, instead of fetching by the last timestamp, you can fetch (time window / 2) and then drop the duplicates – that may be easier to handle and implement:

(It's a different exchange, but the concept is similar across all exchanges)

YuriyTigiev commented 3 years ago

I saw the codes but don't understand how those examples work. Can I use it for copy data from 2019-06-01 00:00:00 from Binance?

kroitor commented 3 years ago

@YuriyTigiev i'll post an example for fetching the trade history from Binance as soon as I can.

YuriyTigiev commented 3 years ago

is rateLimits: 350 optimal for fetching the tradeHistory? should be - enableRateLimit = False ?

kroitor commented 3 years ago

@YuriyTigiev

is rateLimits: 350 optimal for fetching the tradeHistory?

The optimal setting depends on the exchange, since every exchange has varying rate limits for this or that endpoint.

should be - enableRateLimit = False ?

Nope, you should leave it on (True), unless you implement your own custom rate limiter.

ccxt / ccxt