jesse

Preprocessing [most already implemented, but needs error checking + PEP8 and improvements to file/class structure)] --> something is wrong at the moment related to normalization and batch processing. Needs some thorough checking. Please check the open issues in the Github.

load all of the .csv files included as data (there will be more in the future, so load all files in specified folder (either daily or hourly))
the columns should be normalized per file_name (pair/exchange), which can then be applied to the test set.
currently a filename token is given as input to the model, this can probably be removed or made optional.
do a 20-10-70% test validation training split: 20% last rows of each of the files is for test.
provide a production option whereby all data is used for training
custum data loader with stratified sampling

Please check and address all the issues on Github.

Model

allow user to specify the y column to predict. (e.g. Sell_p40_a4)
allow a list of columns to be removed from the model input (i.e., Top_p15_a4,Btm_p15_a4,Buy_p15_a4,Sell_p15_a4,Top_p40_a1,Btm_p40_a1,Buy_p40_a1,Sell_p40_a1,Top,Btm,last_pivot)
input of the model is of dimensions: all_features n-sequence length (n can be set and could be for instance 14 days)
Implement the following model architectures as classes:
- Transformer
- 2 layer LSTM
- 2 layer LSTM with self-attention
- Simple 2 layer FC
- Wavenet (will need bigger input window, please allow easy changing of all model parameteres when calling the class)
Add training/evaluation function with loss/accuracy plot for training, validaton, and test set
Output the confusion matrix for test set
function to save / load model and predict based on small input dataframe (m rows)
batchnorm for training optimization

Predict
- create a predict function that loads the final production model on training + test set and feed in a csv from the data folder: 'production_data_for_new_prediction'
- output dataframe with predictions
- output confusion matrix for this data
- allow me to control the threshold (0.5) for prediction cutoff so that I can increase the precision if needed.

A. Variant: n-to-1

predicts the next 1 element of the specified column.

B. Variant: n-to-m

predicts the last m elements of column y.
it can use the previous t=0 until t=t-m elements of column y as input (this is not the case for Variant n-to-1)
it can use the other x columns as input until t=t as usual.

Notes:

the model can be trained either on the hourly folder, or the daily folder. The datatime column should be flexible enough to accomodate.
Document code very well please and use PEP8 standard, it's ok to create many files/classes etc.

Please push regularly to the repository.

We need easy functions to create/train/predict on new data (from other sources), which we can call from the Colab e.g.:

  my_model = Model_n-to-1(n=14, layers=3,...)
  results = my_model.train(epochs=6,device=gpu_1,data='datafolder',production=False, save='filename')
  my_model.load('filename')
  predictions = my_model.predict(test_data=my_dataframe)

or a slightly better syntax if you can suggest it.

dorienh / jesse

readme

jesse