Performance Profiling Discussion

JaCoderX commented 5 years ago

My recent experimentation with BTGym involve using long 'time_dim' and long episodes respectively. in those settings both cpu usage and memory usage increase drastically.

So what I'm trying to understand, is where are the frameworks bottlenecks and how to measure their performance (especially runtime but also memory)?

performance potential places of interstate (regarding long time_dim and period of episode):

graph runtime and memory of the encoder for forward pass and optimization (using conv_1d_casual_encoder)
server time to instantiate a new episode from backtrader
strategy preprocessing (using MonoSpreadOUStrategy_0)

This is what I've tried so far:

timers on data preprocessing, get_external_state in the strategy - result: fast (short runtime)
strategy next also seem to be ok
some performance about the LSTM loop can now be found on tensorboard (#92). but it doesn't include the encoder part
comparing cpu runtime on tensorborad between 2 models (running on sin example) differing on episode size (doubling the size between models) resulted in more or less linear time growth per env step - seems ok

@Kismuz can you also guide me to the place in the code where the input is first entering the graph? I can see in the tensorboard graph 'x' being passed in the beginning of the graph, but i'm not sure where it is reflected in the code

Kismuz commented 5 years ago

@JacobHanouna ,

strategy next also seem to be ok

just a note: there is a lot of things happening at the background before and after each next call; to my personal experience a lot of time is waisted exactly while iterating strategy and episode data preparation; it can be optimised but in a quite hard way: one can transfer all the preprocessing tasks inside tf graph(i.e. movav estimation, standartisation, differencing etc.); though it should be done after robust data representation scheme is proven to be optimal (it is not the case now). Beauty of backtrader strategy is that one can quickly experiment with data manipulation (in expense of execution speed).

some performance about the LSTM loop

since by default static rnn graph is used I think it is as fast as it gets; grows with recurrent layers size; can possibly be optimised by trying GRU instead of LSTM cells;

...place in the code where the input is first entering the graph

it's done via placeholders:

https://github.com/Kismuz/btgym/blob/master/btgym/algorithms/policy/stacked_lstm.py#L90 https://github.com/Kismuz/btgym/blob/master/btgym/algorithms/aac.py#L532 https://github.com/Kismuz/btgym/blob/master/btgym/algorithms/aac.py#L595

feed dictionary for train step (backward pass) is composed here: https://github.com/Kismuz/btgym/blob/master/btgym/algorithms/aac.py#L1210 via: https://github.com/Kismuz/btgym/blob/master/btgym/algorithms/aac.py#L1101

when using policy for evaluating action or value fn. (forward pass) feeders are composed inside corresponding policy methods: https://github.com/Kismuz/btgym/blob/master/btgym/algorithms/policy/base.py#L294 https://github.com/Kismuz/btgym/blob/master/btgym/algorithms/policy/base.py#L350

JaCoderX commented 5 years ago

one can transfer all the preprocessing tasks inside tf graph

where would be the correct place to do it in the graph/code? or do I need to do it for every input entry (on/off policy, replay)?

Kismuz commented 5 years ago

I suggest making a separate 'preprocessor' network like those contained in nn.networks module and placing it in-between state input placeholders and encoder input; yes, replicating for on/off policy modalities via reuse arg as everything else.

JaCoderX commented 5 years ago

not sure if it completely compatible with your advice on how to add a 'preprocessor' network, but to make it easy for me I made a small test 'preprocessor' network that call internally an encoder and passed the ref to the policy config (this way I didn't need to change any internal code). I basically wrote a custom encoder.

BTW should encoders use resue=True?

Beauty of backtrader strategy is that one can quickly experiment with data manipulation

I agree, it make more sense for experimentation to use the structure of backtrader especially if some of the preproccesing is made in set_datalines (or any other parts of the init)

JaCoderX commented 5 years ago

@Kismuz, I was retraining a simple model after fetching the latest code from repo. kind of a sanity check and one of the things I noticed was cpu_time_sec grow about 4-5 times. original model was about based on gen 5 when it was out.

a lot of changes were made and I'm not 100% sure I used the exact same model but I see a dramatic change in cpu_time_sec and I not sure why it have change so much

can you confirm? or some change were made that can explain it?

Kismuz commented 5 years ago

@JacobHanouna, the only change made that can affect train speed is that broker part of state is now passed through convolution encoder by default (it was true only for external state before). It shouldn't affect train speed much unless very large broker state tensor is used. Anyway 4-5 times slowdowns seems strange. How 'cpu_time_sec' has been obtained - do you have a custom metric at tensorboard? Can you give absolute numbers? Have you looked at your CPU workload via, say, htop util? With 4 cores CPU and four workers all processes typical load should be near 100% periodically spiking down to about 70%

JaCoderX commented 5 years ago

How 'cpu_time_sec' has been obtained - do you have a custom metric at tensorboard?

didn't change the default one

Can you give absolute numbers?

I was retraining a few models cause I wasn't sure which one match the original but on average early models were stable around 13-17 now it shows average around 60-65

on each checkpoint save I used to have cluster speed of around 30 step/sec per worker now it is around 7 per worker

Have you looked at your CPU workload via, say, htop util?

yes I keep track on the cpu and memory workload, I'm not working at full load.

This is my quick impression of the issue will investigate a bit more in-depth

Kismuz / btgym

Performance Profiling Discussion #93