When running UNREAL example

joaosalvado10 commented 6 years ago

Hello,

When I run the UNREAL example I got the following output.

/home/jsalvado/anaconda3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: compiletime version 3.5 of module 'tensorflow.python.framework.fast_tensor_util' does not match runtime version 3.6 return f(*args, **kwds) </home/jsalvado/tmp/test_gym_unreal> already exists. Override[y/n]? y WARNING:Launcher:Files in </home/jsalvado/tmp/test_gym_unreal> purged. 2017-11-27 16:52:54.666375: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA E1127 16:52:54.670453114 18319 ev_epoll1_linux.c:1051] grpc epoll fd: 7 2017-11-27 16:52:54.671044: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA E1127 16:52:54.671596938 18320 ev_epoll1_linux.c:1051] grpc epoll fd: 8 2017-11-27 16:52:54.676864: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> localhost:12230} 2017-11-27 16:52:54.676891: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:12231, 1 -> 127.0.0.1:12232, 2 -> 127.0.0.1:12233} 2017-11-27 16:52:54.677761: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12230} 2017-11-27 16:52:54.677801: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:12231, 1 -> 127.0.0.1:12232, 2 -> 127.0.0.1:12233} 2017-11-27 16:52:54.677844: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12230 2017-11-27 16:52:54.679672: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12231 Press Ctrl-C or [Kernel]->[Interrupt] to stop training and close launcher. 2017-11-27 16:52:59.683070: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA E1127 16:52:59.683829609 18359 ev_epoll1_linux.c:1051] grpc epoll fd: 9 2017-11-27 16:52:59.686654: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA E1127 16:52:59.687214727 18360 ev_epoll1_linux.c:1051] grpc epoll fd: 10 2017-11-27 16:52:59.689904: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12230} 2017-11-27 16:52:59.689941: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:12231, 1 -> localhost:12232, 2 -> 127.0.0.1:12233} 2017-11-27 16:52:59.690832: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12232 2017-11-27 16:52:59.693367: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:12230} 2017-11-27 16:52:59.693405: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:12231, 1 -> 127.0.0.1:12232, 2 -> localhost:12233} 2017-11-27 16:52:59.694368: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:12233 2017-11-27 16:53:04.660194: I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session f67f491b8dcaa755 with config: intra_op_parallelism_threads: 1 device_filters: "/job:ps" device_filters: "/job:worker/task:0/cpu:0" inter_op_parallelism_threads: 2 2017-11-27 16:53:09.148756: I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session dd584888bb6349a4 with config: intra_op_parallelism_threads: 1 device_filters: "/job:ps" device_filters: "/job:worker/task:1/cpu:0" inter_op_parallelism_threads: 2 2017-11-27 16:53:09.294430: I tensorflow/core/distributed_runtime/master_session.cc:1004] Start master session 785d00122230e0e1 with config: intra_op_parallelism_threads: 1 device_filters: "/job:ps" device_filters: "/job:worker/task:2/cpu:0" inter_op_parallelism_threads: 2 WARNING:worker_1:worker_1: started training at step: 0 WARNING:worker_2:worker_2: started training at step: 0 WARNING:worker_0:worker_0: started training at step: 0 WARNING:Env:Data_master reset() called prior to reset_data() with [possibly inconsistent] defaults. WARNING:Env:Dataset not ready, waiting time left: 298 sec. WARNING:Env:Dataset not ready, waiting time left: 298 sec.

Do you know what can be done to make it work? Thank you very much.

João Salvado

Kismuz commented 6 years ago

@joaosalvado10, Short: these are not a errors but a bunch of warnings and logs, training is started and going on. Just start Tensorboard to track progress. Expanded:

..._bootstrap.py:219: RuntimeWarning: compiletime version 3.5... - TF runtime warning, it's ok to proceed but better to upgrade Tensorflow to latest;

...Your CPU supports instructions... - TF warns that it can do better if you use your CPU-cpecific TF compilation. Search StackOverflow and GutHub - a lot of info and precompiled libraries available;
` ...Started server with target: grpc://localhost....', etc. - distributed TF session logs
...Dataset not ready, waiting time left: 298 sec.... - BTgym specific, just ignore if time_left not dropping down to 0.

joaosalvado10 commented 6 years ago

yes you absolutely right, I would like to know how to use the model that is trained on a test data and what is the best way to see the results. Thank you very much

Kismuz commented 6 years ago

@joaosalvado10 Short:

you can just reuse saved model with new dataset.
do not expect good performance as these models do not generalise well [yet].

Expanded:

model is automatically saved (checkpointed) every 5 minutes of training in log directory along with tensorboard summaries. For now you can just relaunch notebook with new dataset and old log_dir and answer no to question if need to purge old data. Note that TB summaries will be appended. If you want summaries from the scratch - manually remove exsiting summaries from log_dir (keep model checkpoints!);
to prevent model to adapt to new data - set learning rate close or equal to zero;
designated TestWorker is not yet implemented;
generalisation is the hardest part of this problem as underlying process is generally non-stationary. I have no decent solution for this at the moment. As I mentioned, this is work in progress and there are some approaches am looking on: meta-learning, such as RL^2, MAML , knowledge transfer methods, source-target domain adaptation (DARLA), etc.

joaosalvado10 commented 6 years ago

Ok, I am going to run the trained model on test data to understand how good is generalizing the problem. I remember of seeing a paper really nice on meta learning maybe you found it interesting https://github.com/cbfinn/maml.

Hope to ear more from you. Great project!

Kismuz commented 6 years ago

Yup, it is an implementation of MAML I have mentioned above. It is in my roadmap.

joaosalvado10 commented 6 years ago

I still have a questions regarding this post, so what i want in this moment is to use the trained model on new data. So my question is given the final trained model (model.ckpt-something.data-something) how can i predict the best action given for example only the columns needed for the model([open,high,low etc]). I think I am a bit confused on RL loop. Thank you for the help.

João Salvado

joaosalvado10 commented 6 years ago

What i want to Do in fact is:

1.Feed the model with a given row of pandas dataframe that has information of m price open/high/low/close. 1.1 (Question) - I need to know what do I use to "transform" this new observation into a state to feed the model. 1.2(Question) - I need to know after having the state which method do i call to get an action that the agent wants to take, I need to know what is the method in the model to predict the next action.

In fact I want to adapt the trained agent in a new environment, this new environment would look like something similar with a live trading so in this case i do not want to use Cerebro. I would receive a new observation as the time goes instead of receiving the all data.
I know that is possible to test the developed model with data test by putting a new csv and chage the learning rate and the other things as you correctly said. My question is, is it possible to use the model to interact with the kind of environment that i described to you?

Thank you João Salvado

Kismuz commented 6 years ago

@joaosalvado10 , First, I have to mention that bt.Cerebro is heavily used in a whole estimation workflow and it will take time and efforts to exclude it; 1.1: You need to take last 30 OHL prices and transform it to numpy array via algorithm described in methods __init__() and get_state() of class DevStrat_4_6, see here: https://kismuz.github.io/btgym/_modules/btgym/research/strategy_4.html#DevStrat_4_6

1.2. Method you asking about is 'policy.act()` and is described here: https://kismuz.github.io/btgym/_modules/btgym/algorithms/policy.html#BaseAacPolicy.act

Note, that Model itself is an abstract term and is in fact conjunction of policy class, AAC framework class and environment_runner class. Shortly, policy_class holds learned parameters, AAC holds loss definition and training loop and runner is responsible for environment <-> policy_a <-> train_loop coordination.

joaosalvado10 commented 6 years ago

@Kismuz , So isn't there a way of creating a new waiting method event that always waits to receive new data and use this data in the pretrained model to return the action taken. This method would be always waiting for new data, in your opinion what is the best way to achieve this without turning the code "upside down".

Kismuz commented 6 years ago

Why not, First, you need to wrap you observations in gym-like episodic environment: reset() will return initial observation. each step() - next and so on; it is necessary as thread_runner expects observations via this kind of environment interface.

Second, you actually need to redefine thread_runner loop; Take a close look on what's going on in https://kismuz.github.io/btgym/btgym.algorithms.html#btgym.algorithms.runner.env_runner You need truncated version of it, something that just gets next_observation and next_action.

joaosalvado10 commented 6 years ago

Yes i think that I am understanding, instead of calling env_reset (in the first step) and step_next in the loop i would need to call some method that would contain the last STATE which is the last 30 OHL prices transformed to numpy array via algorithm described in methods init() and get_state() of class DevStrat_4_6. Also the runner,env_runner could not be a loop that takes all the values as in this moment, it would be necessary that it tooks only one State at the time and not a for loop like in this moment.

Also i have one question The runner is called the max_Env_steps but what the global_Step refers to ?

I have also one question: After performing some experiments i realized that the total reward increases with the time the total loss and entropy decreases which is good and indicate that the agent is learning. However i realized that the episode/final_value does not increase that much with the experience, i was expecting him to maximize the profit. I already changed the drowdown_call and target_Call to 80% and 100% respectively but still not achieving great results on profit (episode/final_value - initial_value(broker_set_cash)). Should be used other reward function? i cant understand really well the one being used.

Thank you, João Salvado

Kismuz commented 6 years ago

@joaosalvado10 , Global_step refers to shared number of environment steps made so far ( i.e. summarised over all environment instances). Used mainly for summaries and learn rate annealing estimation.

Yes, total reward received is usually bigger than final account value (we should see 'broker value' as we suppose all positions will be forcefully closed at the end of the episode). This is indeed the flaw in reward function a have to address. Simply, in sense of 'expected reward' they are same and as RL MDP task is formulated as 'maximising expected return' - that's why this function was chosen. But in fact, having a lot of good trades, agent sometimes 'spoils' the entire episode in last moment.

It can be viewed a gap between theoretical RL formulation (expected performance) and real-life application(one-shot performance).

One of solutions I think of is additional reward shaping functions forcing agent to close all trades or penalising big exposures near the end of the episode. Anyway it's an essential direction to work on.

joaosalvado10 commented 6 years ago

Hello Kismuz thank you for your effort and help.

I am a bit stuck at this moment in the problem. Do you have any like workflow/fluxogram so that I could understand better the way the modules are connected and so I could change the stuff on them ?

johndpope commented 6 years ago

I'd start with @joaosalvado10 - https://www.backtrader.com/

you need to be concerned with bt.Strategy class and it's internal methods. https://www.backtrader.com/docu/talib/talib.html#examples-and-comparisons get a simple vanilla SMA (simple moving average) chart up and running first alone with backtrader.

then progress to - BTgymBaseStrategy https://github.com/Kismuz/btgym/blob/72baca83a7353f541399a264f9156c4ed5d5d026/btgym/strategy/base.py

then extend this class with your bring your own strategy -> (presumably here ) the btgym then extends on this using predictions / inference based of models trained. https://github.com/Kismuz/btgym/blob/b09d2eb42b2a50a43d58ce23b835a5d812ead95b/btgym/research/strategy_4.py // AI stuff get_reward get_state

Kismuz commented 6 years ago

@joaosalvado10, actually the only workflow description I have is here: https://kismuz.github.io/btgym/intro.html#environment-engine-description and it only describes environment itself, not AAC RL framework; As @johndpope commented, it's better to start with backtrader OpenAI Gym operation basics and . Than you can play with https://github.com/openai/universe-starter-agent and examine source code; after that you can easily see that my implementation of A3C is just domain-tuned universe-starter-agent augmented with auxillary tasks, losses, replay memory and additional summaries. Hope someday I' have time to make extensive description but for now i's only https://kismuz.github.io/btgym/index.html and source code.

joaosalvado10 commented 6 years ago

@Kismuz Thank you for the help, I Think that you did a great job, still I think that i am going to have hard time to use this in the way that i want. I would like to ear from you if you think that is any implementation of this state of the art algorithms that can be used(with good generalization and results) in stock market.

@johndpope Thank you for the tips, Have you managed to do any similar work to the one that I have described? Actually what i want is kind of remove cerebro so i can have a routine always receiving new inputs

Kismuz commented 6 years ago

@joaosalvado10,

if you think that is any implementation of this state of the art algorithms that can be used(with good generalization and results) in stock market.

To the best of my efforts, I have not found any open published results proving successful application of deep RL methods to this domain, at the time present.

This does not mean such implementations doesn't exist: due to the nature of domain, there could be commercial/private/corporate owned results we do not know of.

As for generic open source codebases, if you want to use one, my hard-earned advice would be: stick to proven quality code sources such as OpenAI or DeepMind. Those algorithms have a lot of moving parts and are stochastic by the nature, so you can get a very hard time just understanding what's going wrong if your code inherit some subtile bug.

joaosalvado10 commented 6 years ago

@Kismuz I think that using RL learning algorithms proposed by Google like A3C, UNREAL and others is great. However I think that is important to take into account that those ones were made for image, for example they used 2D Conv layers in the network and a question that I still have to my self is: Does it make sense to use those kind of algorithms in a problem like this? Because in image there is spatial and temporal relationship however in the stock market and time series problems this relationship is not true and of course it is possible to replicate the algos as you did (and i think that i would do exactly the same way) you "recreate a image" from time series however i am bit concern that this assumption would lead to not so good results in time series problems comparing to image problems like atari. Off course the stock problem by itself is way more difficult than the atari games as well.

Kismuz commented 6 years ago

@joaosalvado10,

those ones were made for image

think of time series as of images with height 1, nothing wrong with that :)

Seriously, I would point that algorithms itself as absolutely data agnostic, it is parametrised policy estimator architecture that has been tuned for particular input type. And it CAN handle temporal relationships right from first DQN atari game time, even when simple convolutional feedforward architecture was used. In brief, intuition comes from dynamical systems theory, from Takens embedding theorem in particular. It states, roughly, that for any dynamical system S unfolding in discrete time, there exists a finite number N, such as at any time moment entire system dynamics is described by vector of N last states V[t] = [S[0], S[-1], ..., S[-N]], called time-embedding. Note that by above theorem dynamical system S' consisted of states V is always 'markovian', even if original system is not. That's why all feedforward RL estimators in atari domain use 'frame-staking' feature, usually 4 frames. This is, any atari game needs just time-embedding of 4 to became markov decision process, thus enabling correct application of Bellman equation which is in the heart of above mentioned RL algorithms. When you employ rnn estimators, it is exactly rnn hidden state from previous step that holds all the time embedding information in 'compressed' form. But It seems that in practice we need both time-embedding AND rnn context to learn good (=disentangled) spatio-temporal representations as recently noted: https://arxiv.org/pdf/1611.03673.pdf https://arxiv.org/pdf/1611.05763.pdf

johndpope commented 6 years ago

Some other noteworthy repos https://github.com/llens/CryptoCurrencyTrader https://github.com/philipperemy/deep-learning-bitcoin

could talk for hours on this. Basically the approaches to generalize prediction across data has failed to work. It's miserable. With high transaction fees, a break even is to be expected. Having said that - I'm looking into Elliot wave pattern predictions to identify lucrative trading opportunities. http://arno.uvt.nl/show.cgi?fid=131569

joaosalvado10 commented 6 years ago

@Kismuz, yes that is kind off a nice way of thinking on this problem. Referring to what I had said before: " i realized that the episode/final_value does not increase that much with the experience, i was expecting him to maximize the profit." In fact this happens after a lot of runs. Is the cash (2000) set to zero in each episode ? How much money is being allocated when the agent wants to Long or Short? each runner has one thread associated, if i want to use the final trained "model" how do I use the policy.act ?

I am thinking about just call a function inside https://kismuz.github.io/btgym/btgym.algorithms.html#btgym.algorithms.runner.env_runner and this function would return 30 rows [hig low close volume] then i would call a function to transform this into a state similar to the one on https://kismuz.github.io/btgym/_modules/btgym/research/strategy_4.html#DevStrat_4_6 and finally i would call the policy act and then env.step but here i need to change env.step right? Or maybe as i do not need the network to continue learning i can pass this step, what you think ?

Finally i would never do env.reset() i would wait for new 30 rows. What do you think of this approach to achieve what i want?

Kismuz commented 6 years ago

@joaosalvado10, You made me do this :) Here is link to a3c workflow diagram, hope it helps: https://kismuz.github.io/btgym/intro.html#a3c-framework

Is the cash (2000) set to zero in each episode ?

yes it reset to initial 2000 cash;

How much money is being allocated when the agent wants to Long or Short?

it's pure backtrader:
```
# Set leveraged account:
MyCerebro.broker.setcash(2000)
MyCerebro.broker.setcommission(commission=0.0001, leverage=10.0) # commisssion to imitate spread
MyCerebro.addsizer(bt.sizers.SizerFix, stake=5000,)  
```
... means we set single order equal to 5000 and leverage x10, so roughly it can allocate 3 - 4 stakes (each reserves is ~5000/10=500 from account, ~ is here because leverage is actually floating) Overall position exposure can be tacked on episode rendering.

joaosalvado10 commented 6 years ago

@johndpope I have already hear about Elliot waves , it is an old but nice approach. Did you got good results using that approach?

joaosalvado10 commented 6 years ago

@Kismuz thank you very much! That is really helpful :)

Kismuz commented 6 years ago

@joaosalvado10, one more: https://kismuz.github.io/btgym/intro.html#environment-engine-description

huminpurin commented 6 years ago

@joaosalvado10 @Kismuz I'm glad to see someone asked about how to reuse model for new data (live trading) that I've been trying to figure out. I think a lot of people like me are expecting some function like "model.fit(pastdata);model.predict(singlenewdata);" as commonly used supervised learning packages to reuse trained model. I was working on it and the information you gave above is really helpful to sort things out. That being said ,I understand implementing a profitable model should be the first priority. As for this your discussion:

if you think that is any implementation of this state of the art algorithms that can be used(with good generalization and results) in stock market.

To the best of my efforts, I have not found any open published results proving successful application of deep RL methods to this domain, at the time present.

I have read a paper with impressive results on RL algorithmic trading recently. It archived

at least 4-fold returns in 50 days

They implemented something they called Ensemble of Identical Independent Evaluators with deep deterministic policy gradient algorithm. Actually their framework is also on github: https://github.com/ZhengyaoJiang/PGPortfolio

Kismuz commented 6 years ago

@huminpurin, thanks for link to this paper. Really impressive at first glance.

As for implementing portable models. During planning of this project, I had to sort out priorities so my thought was it is mach easier implement reusing methods if you have good trained model at hand :) that's why I put all efforts to agent's design and attempts to get convergence at comparatively large real datasets. As for now, I have been able to get moderate training convergence at one year 1min currency dataset with agent's structure very similar to A3C-NAV stacked LSTM model from https://arxiv.org/pdf/1611.03673.pdf and it required some tricks with data shaping. I'm gonna publish it shortly. As for generalisation ability, it is still low. My attempts with RL^2 approach didn't gave any positive results. Now I'm gonna focus on MAML implementation.

From structural point, providing test or live data to algorithm should be done trough data piping in BTgymDataSet class. Now I have implemented basic testing ability in AAC framework + test data shaping, see here: https://kismuz.github.io/btgym/btgym.html#btgym.datafeed.BTgymSequentialTrial Will also publish example shortly.

huminpurin commented 6 years ago

@Kismuz Great! Cant wait to see those new implement and example :)

joaosalvado10 commented 6 years ago

@huminpurin hello, Actually i come across that paper 2 days ago I think that is really nice approach and I would like to explore it in depth.

Kismuz / btgym

When running UNREAL example #23