ctubio / tribeca

Self-hosted crypto trading bot (automated high frequency market making) in node.js, angular, typescript and c++
https://127.0.0.1:3000
Other
95 stars 26 forks source link

Bug Bitfinex: Min Tick and Restart on markets #84

Closed Camille92 closed 7 years ago

Camille92 commented 7 years ago

Hello Carles,

I think there is a little bug on Bitfinex, Tribeca does not always chose the best Tick for an exchange and consequently restart from time to time.

When the restart happens the Min tick can be corrected or the situation get worst (I've seen both cases).

I think it's is due to the way min tick is determined, and if when the bot starts the price is "round" then it will use the round price as a min tick.

LTC/USD min Tick is 0.001 but if when the bot starts the price is at 25.1, for instance, it will only count 0.1 as the min tick.

At least it's my theory! Maybe a solution is to be sure that there is always 5 numbers in the price or something like that :)

Screenshot 1: LTC/USD Exchange with wrong min tick

capture d ecran 2017-05-17 a 14 58 47

Screenshot 2: Dash/USD, ETC/USD, LTC/USD have been restarted by forever in the last 10mn capture d ecran 2017-05-17 a 14 57 52

Camille92 commented 7 years ago

Update it can happen as well with BTC denominated markets !

Camille92 commented 7 years ago

I've just seen that as well so I post it here also.

Tribeca gets deconnected from the exchange (maybe for too many requests to the api ?)

capture d ecran 2017-05-17 a 16 06 43

beegmon commented 7 years ago

Per my understanding, Tribeca uses the Websocket API to deal with the Bitfinex exchange. There is nothing mentioned in their Websocket API docs, as far as I can tell, about request limits when dealing with Websocket. There are request limits on the REST side (90 requests per second, per ip, account), but Tribeca doesn't use REST with Bitfinex. Instead it appears to use the V2 Websocket API.

That still doesn't mean BitFinex doesn't have some undocumented limit out there for the Websocket side of things though.

If you re running multiple instances against Bitfinex I would try the following to help narrow it down.

1) Start with 1 instance, and work the number of instances up until you see a Disconnect on one of the Tribeca instances. Then back it down 1 instances to see if you continue getting disconnects. You may be hitting an undocumented limit on the number of connections or requests/s on the Websocket side.

2) Check the network stats on the VPS/VM/Server/Computer you are running Tribeca on. See if there is a significant rise in packet loss, packet resends, buffer growth, latency. Running too many instances over a single networking interface, that has to deal with a high number of packets per second, can result in issues. You can either tune the network stack to be more efficient and handle more traffic (increase buffers, reduce timeouts, change kernel timing/io-schedulers, etc), or you can add another network interface an split the traffic over multiple interfaces to keep things from getting overloaded.

3) It maybe that the Bitfinex Websocket API endpoint is just kinda flakey. Tribeca appears to use the V2 version of the Websocket API, and its still marked as beta...so there could be issues that crop up from time to time with the V2 endpoint.

4) If you really want to dig in, add some logging around the Websocket connection messages in the gateway code for Bitfinex. You might be able to better pin point the issue if its Tribeca side or exchange side, just don't run that logging all the time as it will likely slow down Tribeca significantly. You should be looking for errors codes from Bitfinex -- they are described here -- http://docs.bitfinex.com/v2/docs/ws-general .

Also keep an eye out of socket disconnects. These happen a lot of every exchange and you may just be taking longer that expected to reconnect after the socket has been closed on you.

Camille92 commented 7 years ago

Yes I'm running 15 instances on one VPS (2 on OkCoin and 13 on Bitfinex), maybe that's the issue for the disconnections I will probably have to divide it but I thought it would be enough.

Running on AWS with 4go ram and 2 vCPU.

Maybe it's can be linked to the first issue of restarting because of Min.Tick and not achieving then to find the network :/.

Thank you for your comment!

ctubio commented 7 years ago

i found and already fixed a situation when bitfinex reported price values ending with 0 and we were calculating a wrong minTick.

let me try to discover this new situation (including having so many instances that some fail to read minTick) where Bitfinex gateway also fails to correctly calculate the minTick.

ctubio commented 7 years ago

as a note: Bitfinex gateway uses Websocket v2 for market data and orders actions/events, but uses http calls for wallet positions every 15 seconds, and also an http call exists for minTick once at boot.

beegmon commented 7 years ago

WARNING SUPER LONG BUT PROBABLY WORTH A READ IF YOU ARE SEEKING MAXIMAL PERFORMANCE

What instance type are you running (t2.medium?)

There are a couple things to keep in mind when selecting an AWS instance type for Tribeca, or any low latency + high throughput/or distributed application IMHO.

1) VCPU count (but more importantly how many effective CPU cycles you actually get for a given instance class) 2) RAM 3) Network Connectivity 4) Disk IOPS

If you are using a t2.medium (or any type of T instance) I can see a couple of issues with that.

1) T instance types are highly fractional in terms of the CPU cycles they provide. To think about it another way, it's a way for AWS to sell the crumbs left over from the larger hunk of CPU bread that larger instance types don't use. T instance types also run on highly oversubscribed hardware, which means there is tons of competition for CPU cycle "crumbs". They are cheap, but in reality you rarely get what you pay for in terms of CPU cycles that you can actually use.

You can check this yourself.

What is steal (st) percentage? Well to put it simply, it is the percentage of time that the kernel had work to do, but CPU cycles weren't available for it. They were stolen from the scheduler by another process running on that CPU.

Remember the kernel thinks it is in 100% control of the CPU(s) it sees. But that is not really the case. Other processes, that the kernel cannot see and thus cannot schedule for, are running on the same CPU. Thus the kernel tries to submit work to the CPU, it gets told "no, I am working on something else that you don't know about". The kernel see's this as cpu cycles being stolen from it, because it assumes it has full control of all the cycles available on that CPU. It then tries to schedule the work again, and it either is accepted or is told "no" again.

In general T types of instances, when under any sort of load, will see a very high st percentage. Meaning they have a lot of work to do, but no CPU cycles to do it with. Once an instance continually sees 10% to 20% (sometimes as low as 8% depending on the workload), the performance suffers, and the system begins to micro-pause as it attempts reschedule processes it is managing around the stolen CPU cycles that it thought it had control over. In short, T2 types are great for POC, testing, running ssh bastion hosts, or things that don't require CPU, or don't have a need for very tight timing in their processes to work. Tribeca needs very tight timing and a stable amount of CPU cycles to function it's best.

2) While Tribeca doesn't eat a ton of ram, it is always helpful to ensure there is plenty of it. That is because ram is used for not only Tribeca itself but also mongo, and system/network buffers. Having more ram means that Tribeca will always have enough, mongo can store more in Ram (which means faster read/write for hot data), and the system will have lots of memory for disk/network buffers and caches.

The T instance classes has some available with 8 GB of ram, I would still recommend against them due to their generally high CPU steal percentage, and there network performance which I will get into next.

3) T instance types have network connectivity rated at low to moderate. You can think of these ratings at akin to 10mbps, 100mbps. However, network interfaces in many AWS instance types are also fractional and shared across multiple instances and your rating dictates not only the max throuhgput/min latency of your network interface, but also it's priority compared to other instances on the hardware that are attempting to use it. That means that if your T type instance is co-located with a noisy neighbor with a higher priority than you (better network rating) they can consume enough of the shared bandwidth/or network interface "air-time" to effect latency and throughput of your own instance.

A network rating of low or medium is not what you want when trying to talk to an exchange as fast as possible, and sending/receiving 10's of thousands of packets per second. You will quickly find latency and throughput suffer, leading to bad connectivity and closed sockets. Additionally, because the network interfaces are potentially contentious between neighbors on some instances types (like T), you can also run into a situation where you are waiting for IO on the network interface. This compounds the issues above with CPU steal percentage, or eats up more ram to buffer packets, which you might not have if you only have 4 GB of ram total for the system.

4) Finally, disk should be considered as well. While Tribeca doesn't do a ton of writes to Mongo, when mongo does flush to disk it can cause a pause in accepting more writes. Also, if you are running tight on RAM, mongo may have to fetch more from disk to serve a request. You want these fetches to be fast as possible because Tribeca will be waiting for that data to come back.

EBS based instances, like the T type, can be spun up with provisioned IOPS for its EBS, or with larger disks (1TB) that will provide more baseline IOPS. If you are going with an EBS volume, it would be wise to either use provisioned IOPS, or a larger disk for more baseline IOPS.

However, you must also keep in mind, that many cheaper instances have no dedicated bandwidth for EBS. Which means even if you use provisioned IOPS on the EBS volume for them, there is no promise that you will hit a good throughput on the EBS volume if there is lost of data to be written to it. This doesn't effect Tribeca directly (since its writes are low), but it could effect system performance if swap is enabled, it is located on the same EBS volume, and there is limited ram. This causes thrashing on the EBS volume, which is limited in terms of bandwidth (even if the IOPS are high) and once again this compounds the issues with CPU stealage, and lack of RAM to buffer things.

So with all that said here is what I run and fully recommend.

A t2.medium costs $1.13 per day (excluding EBS, network, and other charges). For $3.74 per day you can get yourself a i3.large instance type (excluding network, EBS, and other charges). This gives you lots of awesome things.

1) 2 VCPUs that are mapped to 3.5 ECUs. As opposed to the T type instances which are ECU burstable only (you get left over CPU Cycles), the i3 type has 3.5 dedicated ECUs which your instance fully controls. This improves steal percentage significantly, and I have never seen st % go above .03% when the system is 100% utilized. This means that in general, I get two whole CPUs worth of cycles to process things with, and the kernel rarely (if ever) is told "no". This is the direct result of having 3.5 ECUs for those 2 VCPUs. The ratio is high, resulting in more "bare metal" like performance.

2) You have dedicated EBS bandwidth (50 MB/s). I run an 20 gig OS image on the EBS side that gives me 100 IOPS per second, and 50 MB/s throughput. It is never used because of point (3) below, but if the system ever got in a situation where it needs to thrash the OS disk for a bit, there is at least some headroom to do so before hitting long IO waits on the EBS volume.

3) 15 GB of ram. Plenty for all the Tribeca instances, mongo RAM cache, system processes, and buffers/caches you can throw at it. This also means that the system rarely if ever runs out of ram, never swaps (I actually have swap turned off in my instance). If everything can fit into ram you only touch disk when absolutely needed, which is made even more awesome by point (4) below.

4) A dedicated 475gb NVMe SSD. This is SSD on steroids and it is just stupid, ridiculous fast. I format this SSD with XFS and mount it as /opt. I then have mongo store its data in /opt/mongodata and run tribeca from /opt/tribeca. In doing so the system never waits for IO on the mongo side of things, because the disk is soooo fast.

The downside is that the NVMe is an instance store volume. Meaning that you can reboot the instance, but you can't snapshot it (For backups) or stop and start the instance. When the instance is stopped, it looses the volume because its dedicated hardware to your instance. When it is started again, your instances starts on different hardware, mapped to a different NVMe SSD. This means that if you care about backing up mongo data or Tribeca you must do so with a tar ball and move it off the instance (possibly into S3) or at the very least to the OS EBS volume.

5) Up to 10 gbps network connectivity and AWS ENA (enhanced network adapter support). This gives you nearly the highest priority in the shared network traffic environment (only full 10gbps, and 20 gbps interface are higher priority). For example, I can do over 100K packets per second on my i3 instance without breaking a sweat. That is several orders of magnitude better than a T type instance.

This allows you to burst massive amount of packets per second at the lowest latencies available. And since its ENA enabled the OS (if enabled for it) will avoid using the virtual driver for the network interface and instead use SRV-IO to talk to the network interface directly.

This essentially allows the network adapter to appear as an actual hardware device to the OS, and in turn dedicates it to your use only, while offering the kernel the best view into network buffers, IO, and interface dynamics. I short, the kernel better understands the hardware so it can make better scheduling/control decisions around using it.

You get all of that for $118 a month (excluding EBS, network, and other charges) with On-demand pricing. If you pay for a full year upfront that drops to $72 a month. Compared to the $34 a month cost of a T2.medium its a total steal given the vastly increased performance. And increased performance, generally means Tribeca works better, which usually results in more profit; paying for the more expensive instance (at least in my case) while still being reasonably profitable overall.

OS selection is up to you, but I run CentOS because it is a solid server foundation, and includes other things that help increase performance/make the system more efficient as well.

CentOS comes with system performance tuning utilities that are highly useful for eeking out the last little bit of performance left on the table. Redhat Enterprise Linux (which is used in most if not all HFT trading shops) is what CentOS is based on, and it benefits from the performance tuning capabilities that RHEL has available.

1) tuned -- allows setting kernel parameters for different use cases. One of those use cases is low-latency. By applying the network-latency profile with tuned on a centos installation, it adjusts things like buffers, tunes the network stack, and sets the kernel process scheduler to a sudo "real-time" mode which enforces requirements on when processes must start/stop on the CPU. You can enable the network-latency profile by running tuned-adm profile network-latency . This will persist through reboots, and I would recommend doing a reboot after enabling this profile for the first time.

2) Numad and Numactl work in combination to ensure processes are run on CPU cores that are closest to their respective memory (ram) banks. You can install this on centos by doing yum install numad numactl

One installed do systemctl enable numad and then systemctl start numad

This will ensure the numad daemon manages processes and moves them to the CPU core closes to the RAM banks in which that process is most using.

You can view the numa stats by doing numastat . The output will look sort of like this

                           node0
numa_hit               649381580
numa_miss                      0
numa_foreign                   0
interleave_hit            176265
local_node             649381580
other_node                     0

We want to see 0 numa_miss/foreign values as that means our processes are running on the CPU closes to the ram it is using.

That is about as far as you can push Tribeca currently.

Running multiple instances of Tribeca technically increases the context switching that the CPU's have to do between tasks, which clears CPU cache and causes memory read delays. If you have enough CPU cores you could do a couple of things in addition to the above:

1) Turn off kernel scheduling for all CPU cores except for CPU0. CPU0 will then be used for system processes, IO, IRQ balancing and the like. This is a kernel boot parameter

2) Pin each instance of Tribeca to a dedicated non-kernel scheduled CPU core. This dedicates that CPU core just to that Tribeca instance. This keeps context switching to a minimum and reduces memory read latency. however, you need a CPU core per Tribeca instance to do this.

3) Pin the Mongo process to a dedicated non-kernel schedule CPU core. For the same reason in point (2).

This is total overkill for Tribeca though and you won't see much if any performance gain because Tribeca is written in a garbage collected/event looped language (typescript/nodejs/js). Therefore there will always be GC pauses/event loop lag that will introduce processing delays that are unavoidable without major hackery.

Additionally, the javascript world is so filled with "works for me" or "this can do all the things" library implementations, that they are rarely if ever refactored for performance from system prospective. Instead most JS libraries out there focus on developer productivity, which is perfectly fine, just a different metric.

A non-garbage collected language would benefit Tribeca greatly I think and I beleive you would see benefits from things like CPU pinning/low level process management at the kernel level like the above.

Go would be interesting, even though it is GC'ed it generally takes less hackiness to get it to a state where GC isn't a concern. Go's advantage is its nice threading system, but at the same time threading can actually cause more harm than good for low latency systems due to context switches. I my opinion it would be better to build single thread components that can be pinned to dedicated CPUs and communicate over fast IPC or memory mapped files, instead of attempting the thread operations across multiple CPU cores.

Java could work as well, and lot of HFT is done with Java. They run nearly always as single threaded components on a pinned CPU core and don't GC during the trading day. Instead they have lots of ram, allow the garbage to build up, and then reboot the systems after the trading day is over. That doesn't work so well for 24/7 environments like crypto-markets though.

If I ever get more personal time I want to try my hand at re-writting Tribeca in RustLang. Its been on my list of things to do for nearly a year so it will probably never happen, but then again who knows.

Anyways, this was really really long and is just my 2 sats of opinion. My day job has me do this kinda of systems tuning/profile/performance management for huge distributed systems (think 1000s of instances) all the time, so it's nice to be able to apply some of it to something I do in my personal time as well.

Camille92 commented 7 years ago

Thank you very much @beegmon for this long and detailed message !

Before coming back to that, I'm happy to say that I did not experience some disconnection from the market in the last few hours running, I'll keep an eye on that.

I'll think about changing instance at some point but as you're saying I have to find what is the right instance size/type/price vs performance gain.

I did your test and I found a st value hovering between 0 and 0.3 (which is the minimum you can expect no ?).

It's very interesting to understand how 't' instances are designed, I did not know about that.

I'll think about moving from Ubuntu to CentOS next time I'll do a 'big' update. My focus is now on having Tribeca well working on Bitfinex, finding perfect settings for Stdev and coming up with new ideas for trading.

Thak you for sharing :)

beegmon commented 7 years ago

That is a great steal percentage and you are lucky if you are getting that kinda of percentage on a T type in my opinion. The new T2 types might be a lot better and I haven't played with them much, but the old T types were always terrible for me and I swore I would never use them to do anything useful besides testing and using them as jump hosts for SSH.

All instance types will have some percentage of steal given they are running on shared hardware (unless you are running AWS dedicated hosts), but in general the higher the ratio of ECU to VCPU is, the better things generally get (and the more expensive).

I would still be cognizant of the network ranking of the instance type you are using though. For use cases like Tribeca I wouldn't go below High...you might be able to get away with moderate, but low is just asking for trouble. You may not see the trouble, since Tribeca doesn't appear to measure response time latency just compute latency internally, but it might be adding milliseconds/seconds to exchange calls when the network gets busy on a low ranked instance.

You are also banking on whatever instance type you choose to not be oversubscribed in general. I have seen m and c class instances do as bad as T class instances in terms of steal and network performance. It all just depends on how noisy/utilized your neighbors are on the hardware you share.

Sometimes you can luck out and your instance is landed on hardware that is pretty idle, resulting in performance that is on-par with what is promised. However, more often than not, in my experience at least, you land on an instance that is very busy or has a lot of noisy neighbors from a network prospective and in turn your performance suffers by a good margin from what was promised.

ctubio commented 7 years ago

im not able to reproduce this issue of the wrong minTick in Bitfinex anymore, can you please let me know if you experience this again?

also it will help if you can grab and paste here the content of the bitfienx ticker just when you experience this, for example for BTC/USD is : https://api.bitfinex.com/v1/pubticker/btcusd or for LTC/USD simply is: https://api.bitfinex.com/v1/pubticker/ltcusd

thanks'¡ (meanwhile i keep trying to reproduce it)

ctubio commented 7 years ago

im not able to find the bug here, let me close this meanwhile this is not reproduced again.

the minTick calculation can be validated at https://jsfiddle.net/0dfgmsjp/1/ by using as price the value of last_price from Bitfinex public ticker (see links above)

Camille92 commented 7 years ago

Hello Carles,

Sorry I only had time today to look at the issue. It still happens to me on all the markets (but BTC/USD because the liquidity is high enough).

I played with the formula, I found where the bug is:

The formula takes into account the last number to define minTick. The issue is that Bitfinex doesn't put "0" on the last price if the price is round, so the formula is ineffective.

Look here it's a screenshot from DASH/USD. Last price = 98.4

capture d ecran 2017-05-22 a 13 23 21

The formula expects this number to be 98.400 to perform well not 98.4. So when that happens Tribeca chose a "bad" tick for the market :/

I think a good way to go around this problem is not to look at the last decimal but to use a division by 10000 and the first decimal we get defines the min tick.

(I don't know how to translate that in a formula but I'm sure it's close to what we have already).

here is a list of examples:

For price = 100 -> div by 10 000 = 0.01 -> min tick = 0.01 For price = 26.1 -> div by 10 000 = 0.00261 -> min tick = 0.001 For price = 0.0012345 -> div by 10 000 = 0.00000012345 -> min tick = 0.0000001 For price = 9.99 -> div by 10 000 = 0.000999 -> min tick = 0.0001 For price = 12340000 -> Div by 10 000 = 1234 -> min tick = 1 000.

ctubio commented 7 years ago

ok i see, seems like last_price in BTC/USD is always nice but in other markets the decimals are truncated when are zeros, okok many thanks'¡

ctubio commented 7 years ago

finally used both, https://api.bitfinex.com/v1/pubticker/LTCUSD (or similar) to read the current price and https://api.bitfinex.com/v1/symbols_details to enforce the expected precision in case is truncated

Camille92 commented 7 years ago

It works like a charm, thank you :)