Data Science Ideas - Githubissues

saurabhbansal123 commented 7 years ago

Minor tweaks to ease data fetching

This thread is in continuation to #691

If you rather work with csv or so let me know and I'll provide some scripts to transform data.

The TA environment gekko provides is very basic: it only supports a single market at a time and order execution simulation is limited as well. Gekko is more of a "quick getting started kit" that sets up a very basic and simple environment - provides some stats about performance of simple algos (profit, sharpe, etc)

For a basic back-testing strategy, I am not sure if a 1-min candle stick is a must have. Given the high volatility in the data and since the bot is not to be used for high frequency trading, I feel its better to trade on more aggregated candles( like 5 min or 15-min) , they capture the trend without providing spurious results. Another disadvantage with the small candle time period is the latency. @askmike What are your thoughts on this?

If it is possible, an ideal implementation can be to have both options available for the user for the pull (trade level data/5 in candles) to make it easier to setup. Else can you share the code where the data pull from polo is used, I can try tweaking it a little to add this.

My end objective is to fetch data for all trading pairs in polo ( atleast the top 30 alts) to have enough samples for testing a machine learning model. Once a basic classifier is in place it can be further enhanced using granular data. If I try and use the current implementation to fetch data it will take atleast a few months to get historical data since 2015.

askmike commented 7 years ago

For a basic back-testing strategy, I am not sure if a 1-min candle stick is a must have. Given the high volatility in the data and since the bot is not to be used for high frequency trading, I feel its better to trade on more aggregated candles( like 5 min or 15-min) , they capture the trend without providing spurious results. Another disadvantage with the small candle time period is the latency. @askmike What are your thoughts on this?

Good point, so with Gekko you wouldn't usually calculate on 1 minute candles: you can configure whatever candle size you want (default is 60 minutes / 1 hour). But internally everything is based on minutely candles. Mostly because the codebase does not care what exchange it is and I wouldn't like to change this based on the data given by a specific exchange. 1 minute candles allows for 33 minute candles, but if your smallest candle is 5 min you can only aggregate into 30 or 35 minute candles.

That said, if you know you won't ever be doing any calculations over a candle size that is not modulus 5, you can easily aggregate 5 minute candles into 1 minute candles (but you lose accuracy which is only okay if you deal in candles modulus 5). Because this is tricky (you need to understand the granuality of all data, harder to combine datasets, etc) Gekko does not offer it, but doing this yourself is very easy.

If it is possible, an ideal implementation can be to have both options available for the user for the pull (trade level data/5 in candles) to make it easier to setup. Else can you share the code where the data pull from polo is used, I can try tweaking it a little to add this.

Right now Gekko uses code that pulls in trades and transforms them into candles in this file: https://github.com/askmike/gekko/blob/stable/core/markets/importer.js in combination with this file: https://github.com/askmike/gekko/blob/stable/importers/exchanges/poloniex.js

In order for gekko to properly work you do still need to spoof fake 1 min candles.

My end objective is to fetch data for all trading pairs in polo ( atleast the top 30 alts) to have enough samples for testing a machine learning model. Once a basic classifier is in place it can be further enhanced using granular data. If I try and use the current implementation to fetch data it will take at least a few months to get historical data since 2015.

Keep in mind that when you use the "importer" you are basically run a simple fetch script that inserts records into the database (of the adapter, default is sqlite but postgresql might perform a lot better in your situation).

I'm not sure if you want your build your machine model on top of Gekko output (gekko exposes an API which you can use to backtest - you can do a POST call to start a backtest provide everything you fill in the UI and the response will be the result in json) but I would advice against it since Gekko is set up to be reusable (it uses 90% the same code for running live as for backtests as for importing data - they can all be extremely optimized).

thegamecat commented 7 years ago

My view is a 1m is a must have and my most effective strats run on them.

ellocofray commented 7 years ago

I'm not sure if you want your build your machine model on top of Gekko output (gekko exposes an API which you can use to backtest - you can do a POST call to start a backtest provide everything you fill in the UI and the response will be the result in json) but I would advice against it since Gekko is set up to be reusable (it uses 90% the same code for running live as for backtests as for importing data - they can all be extremely optimized).

I followed this suggestion to build a program that uses / api / backtest and was great. The problem I'm having is the backtest performance. Is it possible to optimize the backtest process? The sqlite database has no keys. Can that influence? Thank you very much.

thegamecat commented 7 years ago

This is fundamentally the problem with using gekko in this way. It's too slow. I've tested it on big aws box, my own i7700k + SSD and so on it's just not going to be quick enough without making gekko multi cpu capable.

As mentioned in another request - Tulip is quicker than Talib but still not going to hit the speeds we'd like.

askmike commented 7 years ago

The problem I'm having is the backtest performance. Is it possible to optimize the backtest process?

This is fundamentally the problem with using gekko in this way. It's too slow. I've tested it on big aws box, my own i7700k + SSD and so on it's just not going to be quick enough without making gekko multi cpu capable.

Yes the backtesting in Gekko is anything but optimised for speed. This is what I posted in another issue:

The idea of Gekko is that it is a simple starters kit for building TA strats, keeping simplicity is achieved by sacrificing more advanced (and complex) functionality that only a small portion of users would need but makes everything more confusing for everything else. Examples of this are:

Simple flow based codebase: 90% of the code for importing is the same as for running backtests is the same as running against a live market.

No possibilities for execution strategies: your strat advices long or short and based on that your portfolio will be reallocated ASAP.

No data below 1 minute resolution

etc.

TLDR: the backtesting code was written with readability in mind, and I am pretty sure that with some effort we can get backtester to be at least 10 times faster. But having a fast backtester was never part of the design goals.

If want the backtester to be faster please help me first to try and understand what you are trying to do: Do you want a single backtest over a very large dataset? Do you want to run different but backtests consecutively? Do you want to backtest multiple strats at the same time?

ellocofray commented 7 years ago

Particularly what I am trying to do is, through a genetic algorithm, find the best possible strategy for MACD. For this I take a set of configurable parameters of the strategy and compare results against backtest output to choose the set of more apt parameters. Returning to your question, what I need is to run the backtest many times over the same data set but with different parameters of the strategy. I was trying to get into the backtest programming but I still do not fully understand how it works. I'm still new to nodejs!

thegamecat commented 7 years ago

I've put my ga here: https://github.com/thegamecat/gekko-trading-stuff

askmike commented 7 years ago

through a genetic algorithm, find the best possible strategy for MACD.

This sounds very interesting, it's a space I have looked at before but never got around to fully integrating into Gekko.

If you want to do something like this efficiently this is how I would tackle it:

load the candles from disk once (not per backtest or generation, but per sim)
transform candles once (into 60m if you want to test over hourly candles)
throw all candles through talib (or tulip) once, and feed only partial results to the strat (no need to recalc per candle).
run all strats of the same generation in parallel.

I think this would way you can backtest one generation of 100 strats faster than you can now run a single backtest.

ellocofray commented 7 years ago

You mean outside gekko?. Sorry I'm a bit confused. Point one seems fair even for current backtest implementation. perhaps easily done with memcached. Point tow: maybe, candle size is a nice candidate to be tested gene.

I will get into it. THANKS!

askmike commented 7 years ago

I've put my ga here: https://github.com/thegamecat/gekko-trading-stuff

This is a great starting point, but it won't be fast because individual backtests are slow (within Gekko).

You mean outside gekko?. Sorry I'm a bit confused. Point one seems fair even for current backtest implementation. perhaps easily done with memcached. Point tow: maybe, candle size is a nice candidate to be tested gene.

I will get into it. THANKS!

All these points where talking about a way to do what you want done fast. Gekko currently does not offer this so you'd have to do all of this outside Gekko. That said: Gekko is very modular so a lot of building blocks are already implemented in Gekko and can be reused to create a fast backtester:

Gekko has abstract candlestore logic (called adapters) that can read candle data from disk.
Gekko has a module for transform candles (called the candleBatcher).
run all strats of the same generation in parallel. can easily implemented by Gekko's plugin architecture, read more here. But you probably don't want to use the full pipeline architecture, as it's not optimized for high throughput.

TLDR: So if you have data in a format Gekko understands, you can reuse a lot of Gekko code to create a backtester that runs fast by using approate modules already found in Gekko's code.

askmike commented 7 years ago

I am going to close this issue as @saurabhbansal123 has not replied anymore since opening (but feel free to keep replying).

If you want to discuss how to do GA using current Gekko, please use #767
If you want to discuss how to make a faster backtester needed to do GA on any kind of scale please open a new issue (note that I love to talk/think about this topic, but probably won't have time to build this any time soon).

amhed commented 7 years ago

I'm trying to do exactly this, but instead of posting to the API I attempted to:

Create the pipeline and run it until the market completes
Modify my config files
Re-instantiate the Market and the GekkoStream with the new config and run the pipeline again

For some reason the second run never happens and I get no output from the debugger. Any idea why?

askmike commented 7 years ago

@amhed Are you reusing the same pipeline? It's very hard to say without knowing exactly what you mean with:

Re-instantiate the Market and the GekkoStream with the new config and run the pipeline again

But in all honestly the big bottleneck in why this is slow is that for each single backtest all candle data has to be loaded from disk. There is not much overhead in using the REST API over wrapping the pipeline yourself.

Another option would be using gekko/core/workers/pipeline which sets up a child process (or you can even hack the child and not use child processes). This way you'll get some optimisations I've already done (silencing all logging, skipping some runtime checks, etc). Note that this is used by the REST API.

askmike / gekko

Data Science Ideas #747

Minor tweaks to ease data fetching

This thread is in continuation to #691