Features Extraction Discussion

JaCoderX commented 6 years ago

Following your work on causal convolution and a few issues mentioning the subject, I started to dig in a bit on it and it looks like a very good research direction.

I have gained my understanding on the topic from those 2 papers: WaveNet Fast WaveNet

If my intuition is correct, the number of layers (assuming 2^n Dilation) we use have an impact on how far the model 'see' historical data and learn from it. so more layers mean better ability to learn long term patterns. does it make sense?

looking at your work on '../research/casual_conv/' how do I know how many layers the network use and the dilated parameter? is it possible to change it?

My goal is to use high resolution data (1-min) but to let the network learn long term features.

also if you can explain in a nutshell what you are trying to achieve in your work on 'CasualConvStrategy'?

Kismuz commented 6 years ago

@JacobHanouna,

the number of layers (assuming 2^n Dilation) we use have an impact on how far the model 'see' historical data and learn from it. so more layers mean better ability to learn long term patterns

exactly that.

how do I know how many layers the network use

in this setup 'look-back' parameter is given (by time_dim parameter of the strategy), thus one can simply infer number of layers to form a full 'pyramid' ending with layer with just single convolution kernel (if we assume that dilation factor is 2 and time_dim is power of 2, which is sensible); if time_dim=2^n --> num_layers = n+1; looks like this:

       * -> o
      ** -> u
    **** -> t
******** ->
-8^in^0

here we got time_dim=8 (looking back 8 time-frames) so number of layers is 4; in each layer number of kernels is divided by 2; if time_dim=128 - encoder will build 8 layers.

upd: uncomment last lines of 'conv_1d_casual_attention_encoder' class to see tensors shapes

work on 'CasualConvStrategy'

this test class is deprecated;

JaCoderX commented 6 years ago

So does it mean if i have 1min data and I want 4 months history (16060*120 do around 2^19)... 19 layers? Will it make sense to work with so many layers?

Also how does the input OHLC is going into the network? Each feature have its own network or there are all in the same one?

Did you based your work according to Fast WaveNet paper?

JaCoderX commented 6 years ago

searching the web on progress made on this subject I came across this paper - Conditional Time Series Forecasting with Convolutional Neural Networks and an implementation for it on github

The paper is based on WaveNet and it explores structure correlation between multivariate time series using conditioning in the stock market.

the repo doesn't seem to be active but the concept is very interesting

Kismuz commented 6 years ago

@JacobHanouna,

... 19 layers? Will it make sense to work with so many layers?

No, I don't think so) Note that on top of convolutional encoder RNN is utilised making it possible to capture correlations far beyond time_embedding period. There are two common approaches to capture time context: use convolutions along time or use context-aware structure (RNN); from my experience combination of both works best; Some explanations here: #23, comment dated 4 Dec 2017;

Conditional Time Series Forecasting with Convolutional Neural Networks ...Fast WaveNet paper

yes, I used these papers, also here: https://arxiv.org/abs/1707.03141 https://arxiv.org/abs/1712.09763

JaCoderX commented 6 years ago

... 19 layers? Will it make sense to work with so many layers?

No, I don't think so) Note that on top of convolutional encoder RNN is utilised making it possible to capture correlations far beyond time_embedding period.

We extract time dependent features and feed it to a policy that learn from experience, sound pretty awesome!!! But we also are trying to focus the learning to learn from 'human' like features, stuff like technical indicators. **which I am not sure that are very necessary as the network can learn useful representation by itself, but a small bias to steer the policy in the direction we want doesn't sound like a bad approach.

That being said when I practice trading I switch a lot back and forth between the graph timeframes to get a better understanding of Macro vs Micro analysis. So my intuition is that the 'state' the network need to 'see' would need to be a combination of macro and micro timeframes (for example a window of 1min data over a few mounts). this way the network have the opportunity to learn both Macro features and Micro features by choosing what is relevant.

Again this is just my intuition, I could be off. Let say we have a 1 min data. how do the network can learn macro features if the window is about 2 hours (time_dim=128)? and if we resample the data to be of 1 day how would we learn micro features?

Kismuz commented 6 years ago

@JacobHanouna,

But we also are trying to focus the learning to learn from 'human' like features, stuff like technical indicators.

..actually we don't. Think of SMA or any other indicator as a 'sufficient signal statistic'. You can feed CWT/DWT or wFT signal decomposition with nearly the same result. The point is the signal statistic should contain sufficient information, be bounded and weakly stationary.

but a small bias to steer the policy in the direction we want doesn't sound like a bad approach.

Indeed, but it' should be better done with higher-level approach like "Guided policy search", see: https://github.com/Kismuz/btgym/blob/master/examples/guided_a3c.ipynb

JaCoderX commented 6 years ago

Going over your replies again and the links to the related papers you mentioned and also the research code... I realized that I misunderstood your implementation of 'conv 1D causal encoder' (as I assumed you used the dilated version suggested in WaveNet). But it gave me a better understanding of the subject. So thanks you for your replies and the knowledge sharing :)

...So my intuition is that the 'state' the network need to 'see' would need to be a combination of macro and micro timeframes... this way the network have the opportunity to learn both Macro features and Micro features...

Now that the implementation is more or less clear to me, this last issue is an open question I have. How can we make sure the model is learning both Macro and Micro representation of the data?

In Atari games we usually only need 4 frames to understand the 'motion' of the object we are interacting with. The assumption in those games that more frames back is not so relevant to understand the current state.

But regarding to price movement my intuition is that using 'only' couple of frames back only give us partial understanding of the 'motion'. Feels to me that to understand the whole motion we need to look a lot back in time.

OK, so I understand that working with a huge 'history window' to define each state is maybe technically not so feasible. But maybe if we resample the data for different timeframes then extract features from each one separately and in the end combine it. maybe we can get a better representation of the motion that is compose of multi timeframes features.

any thought on the subject would be appreciated...

Kismuz commented 6 years ago

@JacobHanouna, the key is context-aware estimator, here it is RNN. That means that "tail of past observations" is encoded in hidden state of RNN (aka context vector) which is passed from step to step. That along can make dynamics markovian. If we take Atari game and employ RNN estimator we do not need to stack 4 frames to force markov property. It trains ok with only one frame fed per step due to fact previous frames are encoded in RNN context. That's the case for Atari example in btgym (sorry, haven't fixed it yet, there are some other problems emerged apart from action space). Plus of a context is that it has no exactly limited 'memory length', one of disadvantages - higher compression of information encoded. How can multi time scale features learning can be promoted? - from my experience one simple and efficient way is to feed a bunch of same statistics averaged over different time-windows - the idea of 'scale' window in wavelet decomposition. The intuition is: slowly-moving statistic (say, SMA 128) tracks prolonged movements while fast-moving, say, SMA 2, tracks fast movements. Quick example: if we set signal as {SMA2, SMA16, SMA64} and time window as, say 128, -> we get 3x128 array as single observation. Note that to estimate single one we need max{sma_periods} + time_embedding_period time steps of original series, here: 192. So this single observation along carries information from as far as -192 steps back; add RNN ability to remember past steps via context and that's it.

JaCoderX commented 6 years ago

How can multi time scale features learning can be promoted?

I found a paper that address this issue using clustering (github) DEEP TEMPORAL CLUSTERING: FULLY UNSUPERVISED LEARNING OF TIME-DOMAIN FEATURES

...The proposed DTC algorithm was designed based on the observation that time series data have informative features on all time scales. To disentangle the data manifolds, i.e., to uncover the latent dimension(s) along which the temporal or spatio-temporal unlabeled data split into two or more classes, we propose the following three-level approach...

JaCoderX commented 6 years ago

another paper I found interesting that address feature extraction from a multi scale data (in video detection field) - Dynamic Temporal Pyramid Network: A Closer Look at Multi-Scale Modeling for Activity Detection

The paper propose a pyramid network that can extract features representation over multiply time scales. the main idea, is to extract short-to-long temporal features in a hierarchy structure based on several time scales by using dynamic sampling.

...To fully exploit temporal relations at multiple scales and effectively construct a feature representation, we propose to use dynamic sampling to decode the video at varying frame rates and construct a pyramidal feature representation. Thus, we are able to parse an input video of arbitrary length into a fixed-size feature pyramid without losing short-range and long-range temporal structures. Nevertheless, our extraction method is very general and can be applied to any framework and compatible with a wide range of network architectures...

...We claim both local temporal context (i.e., moments immediately preceding and following an activity) and global temporal context (i.e., what happens during the whole video duration) are crucial. We propose to explicitly encode local and global temporal contexts by fusing features at appropriate scales in the feature hierarchy...

JaCoderX commented 6 years ago

@Kismuz is it possible to use BTgymMultiData with multiple versions of the same data but resampled for different time frames? So I have the original data (1 min) and also 15 min, 30 min... Data is agnostic but would multiple timeframes mess up the sampling or other technical issue?

I wanna use some 'feature extraction network' on each timeframe and then combine it, in a way similar to the paper above.

If timeframes is not an issue, would it be possible to control each timeframe separately so each timeframe would have a different 'feature extraction network'?

from my experience one simple and efficient way is to feed a bunch of same statistics averaged over different time-windows

This is a simple and cleaver idea. will use it if I reach a dead end with my theory

using multiply correlated assets to archive better predictive performance on single asset trading

taken from a comment you wrote #54 I assume that was one of the ideas beyond developing BTgymMultiData. but how does the network learn from correlation? I think i'm missing some insights about how multi asset data flow through the network

Kismuz commented 6 years ago

@JacobHanouna,

is it possible to use BTgymMultiData with multiple versions of the same data but resampled for different time frames?

MultiData was designed to do exactly the opposite: ingest different data in same timescale synchronously to mimic real-life situation when you subscribe to different instruments and receive quotes in parallel;

if you want to get different timeframes for single instrument there is no need to ingest more than one stream: you can always resample high-frequency data to lower-frequency timeframes and make it separate datafeeds. This is done already inside strategy; backtrader docs and community have a lot of discussion on data resampling; worth checking out.

but how does the network learn from correlation?

estimator can only learn by finding statistical correlations; either it be single time-series or several ones; one sound and simple example is feeding two [potentially cointegrated] assets to find mean-reverting pattern and build trading strategy around it (pairs trading). Exploiting such correlation cannot be achieved with single instrument; another example is feed two streams: price of single asset and, say, somehow preprocessed related sentiments stream; if price statistically follows sentiment - correlation can be found and exploited;

how multi asset data flow through the network

every asset goes through it's own convolution encoder; than all features are concatenated and fed to LSTM bank at single time-step; by default all encoder's weights are shared (assuming we ingest similar data stream like EUR/USD and EUR/CHF); this behaviour can be changed if data streams you provide are of different nature (e.g. price and sentiment) and you want to learn different sets of features.

JaCoderX commented 6 years ago

@Kismuz Thank you for the explanation multi asset data flow and correlation those are fascinating subjects.

you can always resample high-frequency data to lower-frequency timeframes and make it separate datafeeds. This is done already inside strategy...

From what I know, in Backtrader resampling the data is done on the engine level cerebro.resampledata(...) and it can be standalone data or alongside with the direct way of adding data cerebro.adddata(...). Going over BTGym code the only place I found that actually add the data was in the BTgymServer (which make sense as we need to control sampling). So I don't understand what is the pipeline to add resampled data?

From backtrader I know that the main data is being passed to the strategy in form of the datas object. and most of the strategies in this project use only datas[0] (main data source), except for the new PairSpreadStrategy that use a second data datas[1]. but I assume BTgymMultiData is being used for that.

MultiData was designed to do exactly the opposite: ingest different data in same timescale synchronously to mimic real-life situation when you subscribe to different instruments and receive quotes in parallel;

Is using the 'same timescale synchronously' is applied because of network limitation (something like LSTM need to consume data in 'step by step' style)? because I can think of a lot of other applications that can benefit from combining slow varying data and fast varying data. In Backtrader, if I try to combine 2 different timescales in the strategy I would get the most recent data from each instrument.

JaCoderX commented 5 years ago

I don't understand what is the pipeline to add resampled data?

@Kismuz I went over Backtrader docs and forums but couldn't find an clue about what will help me pass a resampled data to BTGym. Tried even building an indicator that will internally do the resampling operation (not a good idea).

Again the problem is that resampling is done on engine level and not strategy level

Kismuz commented 5 years ago

@JacobHanouna , one should separate 'data as information' and 'data as trading instrument'. When you feed some price data in backtasting system like backtrader, it usage is twofold:

first, it uses it to simulate live trade instrument quota stream order execution engine (here called broker) is depend upon: it uses price levels to open/close positions, estimate account pnl. etc.
second, it gives access to data as informational signal stream on which commands to execution engine is based upon.

When we talk about data as trading quota it is treated exactly as OHLC price candle for a 'unit period' of an execution engine and 'unit step' for a strategy: if you pass 1H data, engine can make exactly one simulation step per hour, and strategy estimates can be done once an hour also;

If you pass 1min data to backtrader, your 'unit time' now is 1min: one next() strategy cycle stands for one minute, as well as for broker; so yo can make decisions and make order once per minute. If you wand 'resample' trade quota it can only be done at the 'broker' level.

Now look at data as information flow: if you get 1min data it doesn't mean decision should be made every next minute: one can implement logic making decision hourly or daily: any other time strategy.next() just does not issues any orders; at this level one can 'resample' information arbitrarily

By default broker takes datas[0] OHLC values to estimate portfolio values, orders execution etc/ At the same time one takes same values as information and feeds it to indicators. Now one can treat it as abstract signal: preprocess, extract features etc; just need to leave broker along with original data stream to do proper simulation. As example: let's add hourly price data (of trading asset of interest) and hourly weather data as separate data stream. Now we want accumulate daily weather information and make our trading decision on if average temperature was higher or lower than previous day measurements. Here we resample information at daily basis via our strategy programmed logic, but broker has nothing to do with that: it even doesn't see this line of information. Technically, it sees al original data lines passed and can execute orders on it if requested but it makes no sense: we should care which price line order is issued for (by default is for datas[0]). We don't want to to order 'sell' on temperature line.

Sum: passed in data can be programmatically 'resampled' as decision-making information to any scale coarse than original, not affecting broker 'unit time'; passing resampled data in first place changes broker 'unit time' (as well as decision-making information scale).

JaCoderX commented 5 years ago

Thank you @Kismuz for an awesome explanation :)

I was mixing the two concept together ('data as information' and 'data as trading instrument'). it really make much more sense now.

OK, so on the technical side, say I have a 'data as information' type of data and I want to use it. If I understand correctly datas[0] should be reserved for 'data as trading instrument' as it serve as the clock of the system. But if I want to add 'data as information' to my cerebro instance (that I later pass to the env), wouldn't it take datas[0] as it will be the first data added?

I'm asking because when i look at BTgymServer I see that we first copy cerebro that was passed and then we add the feed.

I hope that my understanding is not completely wrong here :)

JaCoderX commented 5 years ago

@Kismuz can you please provide an example on how to add 'data as information'? I'm not sure I'm on the right track with it

This is what I've tried: create a custom csv feed for new 'data as information' and import the data with it to backtrader

class CustomFeed(bt.feeds.GenericCSVData):
    params = (
        ('nullvalue', 0.0),
        ('dtformat', 0),  # 1 give you the option to work directly with timestamp

        ('datetime', 0),
        ('open', 1),
        ('high', 2),
        ('low', 3),
        ('close', 4),
        ('volume', 5), 
        ('time', -1),
        ('openinterest', -1)
    )

ExternalData= CustomFeed(dataname='xxx.csv', timeframe=bt.TimeFrame.Minutes, compression=60)

add it to the cerebro instance to be later passed to the BTGym env_config cerebro.adddata(ExternalData)

JaCoderX commented 5 years ago

I came across this blog which gives an overview of alternatives to RNN. Several architecture are presented as state-of-the-art for seq2seq: Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction Attention Is All You Need

Results seem to be more or less similar but the second one is interesting as it is based only on multi head attention layers.

another architecture that is presented in the blog is 'hierarchical neural attention encoder' (paper)

In the hierarchical neural attention encoder multiple layers of attention can look at a small portion of recent past, say 100 vectors, while layers above can look at 100 of these attention modules, effectively integrating the information of 100 x 100 vectors. This extends the ability of the hierarchical neural attention encoder to 10,000 past vectors.

I'm planning on experimenting with long sequences and I'm curious if the current implementation conv_1d_casual_attention_encoder have the same properties for effectively working with long sequences?

A few more papers worth looking into on the topic of attention: The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation - combine both approaches from above. Latent Alignment and Variational Attention

Kismuz commented 5 years ago

@JacobHanouna, yes this way is ok if you run your environment manually once; if you use training framework you need your information sources to be in sync to allow episode sampling. Easiest way to do it is to use MultiDiscreteEnv class setup. If look closely at https://github.com/Kismuz/btgym/blob/master/examples/multi_discrete_setup_intro.ipynb

one can notice that two out of four data streams added are 'information streams' and can not be traded:

engine.addstrategy(
    CasualConvStrategyMulti,
    ................
    asset_names={'USD', 'CHF'},  # that means we use JPY and GBP as information data lines only
    ................
    order_size={
        'CHF': 1000,  
        'USD': 1000,
    },
)
data_config = {
    'USD': {'filename': './data/DAT_ASCII_EURUSD_M1_2017.csv'},
    'GBP': {'filename': './data/DAT_ASCII_EURGBP_M1_2017.csv'},
    'JPY': {'filename': './data/DAT_ASCII_EURJPY_M1_2017.csv'},
    'CHF': {'filename': './data/DAT_ASCII_EURCHF_M1_2017.csv'},
}

dataset = BTgymMultiData( ...........)

env_config = dict(
    class_ref=MultiDiscreteEnv, 
............
)

only data in assets list can be traded and only for order size list defined respectively. If, say, we want one instrument to be traded and tree streams carry decision-making information - it can bee easily declared. Internally BTgymMultiData performs time-date index consistency checks and allows only properly aligned data.

Note that you should configure your strategy to work with data names given.

JaCoderX commented 5 years ago

Easiest way to do it is to use MultiDiscreteEnv class setup.

Ok thanks I will start exploring it :)

@Kismuz I'm experimenting with conv_1d_casual_attention_encoder and conv_1d_casual_encoder and except for the attention part there is one more part that is not the same, in conv_1d_casual_encoder there is also a sliced_layers part.

I couldn't decide if it was missing in conv_1d_casual_attention_encoder or it is a different algorithm?

JaCoderX commented 5 years ago

Ok I think I figured out what is the difference between the algorithm. The 'sliced' version only take the last part of each layer while attention need to work on the whole layer

JaCoderX commented 5 years ago

If I already commented on conv_1d_casual_encoder I think I can offer a small contribution here. the encoder have a param called conv_1d_overlap=1 and it is only used to determine the depth of the sliced layer.

if I understand correctly the intention of this param as being the overlapped section of the conv filter, then it will make sense to set the stride in conv1d as follows:

  y = conv1d(
                ...
                stride=conv_1d_filter_size - conv_1d_overlap,
                ...
            )

** while checking/forcing stride>0

also @Kismuz, from your experience, would you say that using conv_1d_gated shows improvements?

Kismuz commented 5 years ago

@JacobHanouna,

it will make sense to set the stride in conv1d as follows...

! yes indeed it is correct; would you make pull request?

would you say that using conv_1d_gated shows improvements?

no but it is possibly due to the fact I barely scratched the architecture choices;

JaCoderX commented 5 years ago

! yes indeed it is correct; would you make pull request?

sure :)

JaCoderX commented 5 years ago

I have a couple of more technical question regarding how the conv1d encoder works.

looking at the encoder code, I notice that right before conv1d is made the input is reshaped in a way that seems to remove the original time axis. after the conv1d operation the second reshape is applied with a modified time. can you please explain a bit what is achieved this way? and what does the new time steps mean when later applied to the LSTM?
when applying conv1d on multi feature channel, does each channel get analyzed separately? (meaning information is not mixed between channels?)

Kismuz commented 5 years ago

@JacobHanouna ,

when applying conv1d on multi feature channel, does each channel get analyzed separately?

no; it can be done via separable convolutions layers, see: https://www.tensorflow.org/api_docs/python/tf/nn/separable_conv2d ...but in practice it is pain-slow and does not bring substantial improvements (at least for data representations being used);

after the conv1d operation the second reshape is applied with a modified time. can you please explain a bit what is achieved this way?

it is made to be able to choose to work via either static or dynamic rnn graph; some explanation here:

https://kismuz.github.io/btgym/btgym.algorithms.html#module-btgym.algorithms.aac

...scroll down to Note: On time_flat arg to read on.

JaCoderX commented 5 years ago

@Kismuz I'm reopening this discussion based on an article I read recently, Implementing Temporal Convolutional Networks.

in the comment section below the article I came across a comment that might have implications to the conv_1d_casual_encoder and it's likes, because of the use of layer norm instead of weight norm

the commenter stated:

I think that replacing weight normalization with layer normalization makes the TemporalBlock not causal, since it normalizes using all the time steps, not only the past ones

and the reply:

Good catch! The layer normalization in the original implementation does leak the distributional information from the future into the past. Since it’s does not directly leak the data into the past, I’d guess the impact is limited.

Kismuz / btgym

Features Extraction Discussion #79