Why is a TCP data arrival datetime set on TickData and other data types?

goodboy commented 4 years ago

I'm trying to understand why exactly the ib_insync.objects.TickerData type has the .time attribute set to datetime.now() inside the wrapper.tcpDataProcessed() callback. I realize this time is actually captured when the data arrives (but is unprocessed) and is used for internal timeout tracking but then is set on the ticker after processing?

Why is it a datetime.datetime type at all? The internal timeout tracking converts it to seconds anyway for use?

Reasons this seems not useful:

datetime.datetime types are not easily serializable (for anything other then pickling - which no serious system uses) and usually can't be marshalled without casting to a simpler form. This makes TickData and friends non-suitable data types for streaming with IPC.
The conventional index for time series systems is to use epoch time with a resolution choice determined by resolution constraints. There seems to be no good reason to avoid that convention especially since datetimes can always easily be created from epoch values in whatever local type system is in use.

There is latency added to the system by almost an order of magnitude if a user now wants to convert back to epoch:


[ins] In [9]: timeit time.time()                                           
123 ns ± 2.29 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

[ins] In [10]: timeit datetime.datetime.now().timestamp()
911 ns ± 42.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

[ins] In [11]: timeit datetime.datetime.now()
448 ns ± 14.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


- There is already a variety `generticTickType`s that can be chosen which makes IB provide the timestamp of the actual `TickData` fill or book state change at the source exchange:
  - [RT Volume](https://interactivebrokers.github.io/tws-api/tick_types.html#rt_volume) provides time stamps of the actual book state changes in ms
  - *Last Timestamp*: Time of the last trade (in UNIX time)

As far as I can tell the current time stamp is mostly useless information from a `TickData` perspective since likely no trading system cares about when `ib_insync` receives data over the local TCP socket (at least one would hope :wink:). Even if a user needed this stamp they can easily capture it in their own code which is probably more useful for latency measurements anyway.

I propose either this `Wrapper.lastTime` stamp be converted into a simple `time.time()` (which would make it strictly less work to do the [internal time keeping anyway](https://github.com/erdewit/ib_insync/blob/81f9f6c34e6f9632ba1ea16864f6774975986109/ib_insync/wrapper.py#L190)), **and** (one of) the generic tick type above always be subscribed to and appended to `TickData` types and friends (or equivalents where possible) if that is a default attribute that you want to keep on the type.

drusakov778 commented 4 years ago

hi @goodboy , as always it is a good and very informative read. I think I am learning a lot from your approach.

If I can be a proponent for ib_insync approach here, I would highlight the following points:

datetime as a 'native' python class for storing datetime information seemed as an obvious choice for me. Although now you made me re-think it with the serialization comments (I came across that too) and timing info
Contrary to what you are saying, I do see it important to know exactly when the tick arrives to my system. As this is the only practical time for me and not IB-time, over which I do not have control.
On the other hand, I think ib-insync should not drop the time-information coming from IB itself, as this also can be important for some latency evaluation. I am referring to this discussion on ib-insync forum that has been on recently: https://groups.io/g/insync/topic/73333384
another thought is again on backward compatibility, and changing ticker.time type can be potentially disruptive, and all this trading via IB is retail-level anyway, and not low-latency/high-freq type, so there's a tradeoff between speed and convenience

Thank you again for you comments. Hopefully we do not distract @erdewit too much with a discussion here.

goodboy commented 4 years ago

Contrary to what you are saying, I do see it important to know exactly when the tick arrives to my system. As this is the only practical time for me and not IB-time, over which I do not have control.

Right. I'm not saying don't track internal time, I'm just saying you don't need (and actually aren't using) a datetime to do it. Furthermore this time is not useful for keeping on TickData types (and unexpected). You can ask the Wrapper directly for the last data received timestamp already using .lastTime any time you'd like.

and all this trading via IB is retail-level anyway, and not low-latency/high-freq type, so there's a tradeoff between speed and convenience

Define "low latency" :wink:. I mean, IB is one of the few platforms you can get high-frequency-enough for automatic strat purposes. My crew uses it more then happily for low Hz "high frequency" trading, so.

another thought is again on backward compatibility, and changing ticker.time type can be potentially disruptive,

This is fair but I would be fairly surprised to see this field being used for anything critical other then indexing. In which case adding a new field, like say, api_receive_time or something, could be added as a time.time() output and the old .time deprecated on a new major release. User code only needs to add datetime.datetime.fromtimestamp() on the new value to get the old format.

so there's a tradeoff between speed and convenience

For me, you could probably make ib_insync's core TCP loop quite performant if it was simplified and run on uvloop (I know this for a fact having built such production grade systems before). If the core was simpler and more about streaming low level data then you can make this project get that much closer to being seriously considered for higher frequency strats.

Imo the core could be kept simple and built for steaming and then all the extra syntactic sugar can be added on higher level apis that a user can opt into. I don't think you should have your low level interfaces adding unnecessary (and unconventional in the case of using datetimes) latency. The same goes for this dom tracking stuff as per #270.

drusakov778 commented 4 years ago

agree with you views @goodboy , I am becoming quite curious to know what trading team are you guys from ; )

I guess it is more a question to @erdewit where he wants to take this project, as he is basically doing it for free for the benefit of all of us.

goodboy commented 4 years ago

As this is the only practical time for me and not IB-time, over which I do not have control.

@drusakov778 I'm also not advocating tracking any IB time at all. I'm saying api times are irrelevant to market data points. What is actually useful is exchange time. I already pointed to the relevant tick data types.

goodboy commented 4 years ago

as he is basically doing it for free for the benefit of all of us.

Indeed, which is why I'm trying to help make it better :smile_cat:. You'll probably notice that none of the issues I've created are without technical merit :wink:

erdewit commented 4 years ago

I propose either this Wrapper.lastTime stamp be converted into a simple time.time()

That would mean time as a float. I've considered that at the beginning of the project but decided that it is too unreadable and low-level. It also brings in the 0.1 + 0.1 + 0.1 floating point issues that complicate time arithmetic and makes the distinction between closed-open and closed-closed time intervals impossible. asyncio does use float internally and has to add a fudge factor to solve this.

Personally, I use int64 for time and it works great in our cython-optimized evaluation and backtesting. Pandas and numpy also have int64 (ns or us since epoch). But it's too low-level for general use.

There is latency added to the system by almost an order of magnitude if a user now wants to convert back to epoch:

Not really. The Wrapper.lastTime datetime is created only once for all data in a TCP packet. It's then used for order updates, logs, executions, live ticks, etc, which makes it very efficient and keeps things nice and causal. Note that instead of micro-benchmarking it's more useful to profile a running application and to look at the top lines; you will see no datetime-related methods there.

For me, you could probably make ib_insync's core TCP loop quite performant if it was simplified and run on uvloop (I know this for a fact having built such production grade systems before). If the core was simpler and more about streaming low level data then you can make this project get that much closer to being seriously considered for higher frequency strats.

It runs fine with uvloop but that won't speed it up much. Uvloop starts to shine with 1000s of requests per second, not with the max 50 that IB allows. ib_insync is already vast overkill for anything that TWS or gateway can throw at it. Factor in IB's use of Nagle algorithm and snapshotted data and the whole thing becomes moot.

So, to resume: I'd like to keep datetimes throughout.

drusakov778 commented 4 years ago

gentlemen, thank you both for an informative discussion. I am learning a lot.

goodboy commented 4 years ago

That would mean time as a float. I've considered that at the beginning of the project but decided that it is too unreadable and low-level. It also brings in the 0.1 + 0.1 + 0.1 floating point issues that complicate time arithmetic and makes the distinction between closed-open and closed-closed time intervals impossible.

This is fair; fp has issues, though (as mentioned in 50 Hz restriction from IB) it's strange to be concerned about resolution on the femtosecond scale :smile: Also, why would you be adding time stamps when it's an epoch value :shrug:?

Anyway, the stdlib has a better solution to sidestep float and that is using the new time.time_ns(). This would also pair with numpy's datetime64/int64 well which is what most of the systems I use rely on.

Personally, I use int64 for time and it works great in our cython-optimized evaluation and backtesting. Pandas and numpy also have int64 (ns or us since epoch). But it's too low-level for general use.

Gtk you're familiar with cython :surfer:.

Yes, my infra relies on streaming numpy arrays (and/or apache arrow equivalents) so something that is closer to that (or at the least lower level then that) is ideal. Being forced to use the stdlib's datetime is also somewhat clunky when there are other better 3rd party libraries like arrow (note not apache arrow) or pendulum. IMO users should be able to choose whatever date time type they prefer instead of it being used in the core. This also goes back to my point about serialization.

Not really. The Wrapper.lastTime datetime is created only once for all data in a TCP packet.

Yes but every time a user wants to get the epoch that is where the cost is. Either way this point is lesser when compared to the serialization - primitive data types should not have embedded language specific subtypes.

It's then used for order updates, logs, executions, live ticks, etc, which makes it very efficient and keeps things nice and causal.

Yeah I honestly can't find these use cases in the code base and especially not in a way that requires a datetime specifically. Maybe i'm missing something?

It runs fine with uvloop but that won't speed it up much. Uvloop starts to shine with 1000s of requests per second, not with the max 50 that IB allows.

Very fair. I was actually going to suggest instead cythonizing the protocol parser to squeeze some more latency out of the ib_insync pipeline.

erdewit commented 4 years ago

I was actually going to suggest instead cythonizing the protocol parser to squeeze some more latency out of the ib_insync pipeline.

Why don't you try PyPy, it has Python 3.6 support and ib_insync runs great with it. Perhaps it can resolve your obsession with how long it takes to do .timestamp() on a datetime:

With regular Python:

import datetime as dt
now = dt.datetime.now(dt.timezone.utc)

%timeit now.timestamp()
228 ns ± 1.26 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

With PyPy:

%timeit now.timestamp()

17.9 ns ± 0.0283 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)

With a custom function under PyPy:

ord0 = dt.datetime.fromtimestamp(0).toordinal()

def timestamp(d):
    return (d.toordinal() - ord0) * 86400 + d.hour * 3600 + d.minute * 60 + \
        d.second + d.microsecond * 1e-6

%timeit timestamp(now)

5.7 ns ± 0.0277 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)

That's about 22 CPU cycles on my laptop.

PyPy is also really great for speeding up the asyncio event loop. It helps slightly but not much with decoding though, this is mostly parsing numbers which is slow in any language. For fun, try parsing IB's favourite floating point number:

Python:

%timeit float('1.7976931348623157e+308')

426 ns ± 2.78 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

PyPy:

%timeit float('1.7976931348623157e+308')

391 ns ± 1.11 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

erdewit / ib_insync

Why is a TCP data arrival datetime set on TickData and other data types? #272