enzoampil / fastquant

fastquant — Backtest and optimize your ML trading strategies with only 3 lines of code!
MIT License
1.52k stars 239 forks source link

How to pass Highs/Lows from DF to a Strategy? #175

Closed windowshopr closed 4 years ago

windowshopr commented 4 years ago

Hey!

Love the tool, just started using it. I see a lot of examples of how to apply some indicators to the closing price in the strategies.py file, but I'm curious because I want to build a strategy using the PSAR, and this requires the High and Low from the passed in DF as well. In the strategies.py file, there's a couple of format mapping examples of "c" and "cv" for I'm assuming the "close" price, or whatever price is in the DF being passed in, but how can one also include the highs and lows, or basically the entire df?

I'm assuming we'd have to create our own format mapping for "ohlcv" in the strategies.py file, and then also add to the:

    self.dataclose = self.datas[0].close
    self.dataopen = self.datas[0].open

line to include highs and lows and volumes? But how would we go about doing this?

Thanks!

enzoampil commented 4 years ago

Hi @windowshopr , thanks for using fastquant :smile:

Firstly, you are correct that our format mappings are stored in DATA_FORMAT_MAPPING and currently we don't support including highs and lows, or basically the metrics under the ohlcv format aside from close (c) and volume (v). We have an existing issue (#83) though to automatically infer the data format based on the column so that any permutation of ohlcv is supported.

As a quicker fix though, we can also just add a new data format to DATA_FORMAT_MAPPING for the whole ohlcv (assuming your dataframe has all 5 metrics, with the same column names).

    "ohlcv": {
        "datetime": 0,
        "open": 1,
        "high": 2,
        "low": 3,
        "close": 4,
        "volume": 5,
        "openinterest": None,
    }

After this, you're correct that we just add the lines below to the constructor of BaseStrategy:

self.datahigh = self.datas[0].high
self.datalow = self.datas[0].low
self.datavolume = self.datas[0].volume

As for implementing PSAR, this thread can guide on how to do it on backtrader from scratch, but we can also "register" this strategy on fastquant. Essentially, we'll just have to add a new strategy class, say PSARStrategy (which inherits from BaseStrategy) - shouldn't be more than 25 lines of (uncommented) code :smile:

If interested, please feel free to implement the above and send over a PR, will be glad to guide you through it as well! If not, we can leave this as a feature request and we can have this added within the next few weeks :smile:

windowshopr commented 4 years ago

Incredible answer, thank you so much.

Let me play with creating the strategy locally and refer to the resources you sent me first so I can ensure I get it working first. I've already played with creating my own strategies based on the moving averages that use the "c" price passed to it already, so I want to try and create a basic strategy like the PSAR for some other columns as well. I will close the "issue" for now, but will request a PR (I've never done one before :D ) when I get it working. Would love to see others contribute to that strategy catalogue! Will advise when I get some time to work on it. Thanks!

enzoampil commented 4 years ago

Thanks @windowshopr and best of luck!

windowshopr commented 4 years ago

@enzoampil Alright, I had some time tonight, and I'm stumped on where I'm going wrong, hoping to get some insight.

So in strategies.py:

  1. I added the following under DATA_FORMAT_MAPPING = {:
    "ohlcv": {
        "datetime": 0,
        "open": 1,
        "high": 2,
        "low": 3,
        "close": 4,
        "volume": 5,
        "openinterest": None,
    }
  1. I added the following data variables:
        self.datahigh = self.datas[0].high
        self.datalow = self.datas[0].low
        self.datavolume = self.datas[0].volume
        self.dataopenint = self.datas[0].openinterest
  1. My custom PSARStrategy class looks like this (and I think this is where something might be incorrect):
class PSARStrategy(BaseStrategy):
    """
    Parabolic Stop and Reversal Strategy

    Parameters
    ----------
    period : int
        The period used for the PSAR indicator
    af : float
        Acceleration factor used by the indicator (default is 0.02)
    afmax : float
        Maximum acceleration factor allowed by the indicator (default is 0.2)
    """

    params = (
        ("period", 2), # Period used for the PSAR indicator
        ("af", 0.02),  # Step-wise acceleration 
        ("afmax", 0.2) # Max acceleration factor
    )

    def __init__(self):
        # Initialize global variables
        super().__init__()
        # Strategy level variables
        self.period = self.params.period
        self.af = self.params.af
        self.afmax = self.params.afmax

        print("===Strategy level arguments===")
        print("period :", self.period)
        print("af :", self.af)
        print("afmax :", self.afmax)

        psar = bt.ind.PSAR(period=self.period, af=self.af, afmax=self.afmax)

        self.crossover = bt.ind.CrossOver(
            self.dataclose, psar
        )  # crossover signal

    def buy_signal(self):
        return self.crossover > 0

    def sell_signal(self):
        return self.crossover < 0
  1. Added "psar": PSARStrategy, to the STRATEGY_MAPPING = {

I created a simple backtest script that downloads AAPL's stock data for the last 5 years or so, and run the backtest() command. The script runs without error, however it doesn't seem to 1) plot the SAR on the plot (there's a legend for it, but the dots are nowhere to be seen), and 2) the strategy doesn't seem to be working in that no trades are placed.

I think it has to do with how I'm calling the PSAR indicator. I also tried passing it the full df by adding self.datas[0] to the beginning of the indicator function, but same thing. What am I missing!? :D

Thanks!

enzoampil commented 4 years ago

Can you post your code used and the full logs, and the picture if possible? 😁

enzoampil commented 4 years ago

@windowshopr Ah I think it might be the format parameter.

When you call the backtest function, can you set data_format="ohlcv".

This has to be explicitly set for now, but we're working no automating it in a coming release 😁

windowshopr commented 4 years ago

@enzoampil THAT'S EXACTLY WHAT IT WAS! THANK YOU THANK YOU THANK YOU!

Worked like a charm! No need for a pull request, feel free to just add that strategy to the next round of improvements! I should mention that I used the line psar = bt.ind.PSAR(self.datas[0], period=self.period, af=self.af, afmax=self.afmax) in the strategies.py file. Ah can't wait to play with it now haha thanks a lot for your help! Great work!

enzoampil commented 4 years ago

Welcome @windowshopr glad to have helped ! Sure you don't want to send a quick PR? 😄 I can add you as a collaborator to make it simpler as well.

Just have to save your code in a separate branch and commit it (reference here) :)

But no worries as well if no! Thanks again for stopping by :smile:

windowshopr commented 4 years ago

@enzoampil haha well, perhaps, if you can be a hero one more time and help me debug this last issue.

I've posted a minimal example of the code I'm testing below so that anyone can run it, so long as the PSARStrategy has been made as described above in the strategies.py file.

Basically, I download a dataset, split it into train and test sets, and then attempt to run the backtest() function on both. The train run works fine, but the test run returns the following traceback, and I'm stumped as to why because from what I can see, both functions are the exact same, just using different datasets??:

train_results :
   init_cash  buy_prop  sell_prop execution_type  period  ...     rnorm   rnorm100  sharperatio     pnl  final_value
0       1000         1          1          close       3  ... -0.141852 -14.185182    -8.936320 -258.20   741.802466
1       1000         1          1          close       2  ... -0.141852 -14.185182    -8.936320 -258.20   741.802466
2       1000         1          1          close       3  ... -0.148218 -14.821766    -2.345724 -268.90   731.096908
3       1000         1          1          close       3  ... -0.148218 -14.821766    -2.345724 -268.90   731.096908
4       1000         1          1          close       2  ... -0.148218 -14.821766    -2.345724 -268.90   731.096908
5       1000         1          1          close       2  ... -0.148218 -14.821766    -2.345724 -268.90   731.096908
6       1000         1          1          close       3  ... -0.152320 -15.231992   -11.877703 -275.76   724.238285
7       1000         1          1          close       2  ... -0.152320 -15.231992   -11.877703 -275.76   724.238285

[8 rows x 14 columns]
Traceback (most recent call last):
  File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\indexes\base.py", line 2646, in get_loc
    return self._engine.get_loc(key)
  File "pandas\_libs\index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'rnorm'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test.py", line 147, in <module>
    period=psar_period, af=af, afmax=afmax, verbose=False, plot=show_plots, buy_prop=buy_prop_perc, sell_prop=sell_prop_perc)
  File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\fastquant\strategies.py", line 853, in backtest
    optim_idxs = np.argsort(metrics_df[sort_by].values)[::-1]
  File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\frame.py", line 2800, in __getitem__
    indexer = self.columns.get_loc(key)
  File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\indexes\base.py", line 2648, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas\_libs\index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 1619, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 1627, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'rnorm'

I did some digging by turning verbose on for both training and testing, and it seems like testing is not even attempting to grid search over the ranges I have defined, but it does for train beforehand. Turning verbose on, one will see that print("Number of strat runs:", len(stratruns)) returns the 8 combinations, but just before this error occurs during the test part, it returns a 0, so no table is actually being generated during the test run, hence it can't sort by the rnorm.

Two things to note here are 1) it'll run fine if defining only 1 int/float value (not range) for the PSAR params in the beginning, and 2) I use a function to allow a range using floats. I'm not convinced that's what's causing the issue as it runs fine for the training run. And this issue is only happening as part of this new strategy build, so I'd like to get to the bottom as to why.

Rescue me one more time if you have the time? :D haha. Here is my code. Thank you! :

from fastquant import get_stock_data, backtest
import matplotlib.pyplot as pl
import pandas as pd
import numpy as np
pl.style.use("default")

###############################
# User Variables
###############################

# List of symbols to backtest
symbols_list = ['AAPL']

# Price to backtest, open or close, or all?
price_to_download = 'ohlcv' # 'o', 'c'

# For historical dataset download
start_date = '2015-01-01'  # yyyy-mm-dd
end_date = '2020-08-04'    # yyyy-mm-dd

# How big to make your 'training' backtest dataset (percentage).
# Test set will be what's left.
# NOTE: Make sure you have enough rows/days in the test
# dataset chunk for the indicators to calculate from, otherwise
# you'll get an Index out of range error.
train_perc = 0.35

# What do you want the trading commission to be, in $
commission_fee = 0.0075

# Starting account balance
init_cash = 1000

# Want to see the training, testing and valid plots for each stock?
show_plots = True

# What percentage of account balance to buy with
# Example, if 10 stocks listed above, and 10,000 init_cash, this 
# variable would be 0.1, or use 10% of total account to buy a position per stock
#buy_prop_perc = float(init_cash/len(symbols_list)/init_cash)
# or
buy_prop_perc = 1

# What percentage of position to sell
sell_prop_perc = 1

# The strategy to backtest
backtest_this_strategy = "psar"

# Make a function to allow ranges with float values
def range_with_floats(start, stop, step):
    while stop > start:
        yield round(start, 4)
        start += step
# Define some ranges to search across while training strategy picked above
# PSAR
psar_period = range(2, 4, 1) # Period used for the PSAR indicator
af = range_with_floats(0.02, 0.04, 0.01) # Step-wise acceleration 
afmax = range_with_floats(0.2, 0.4, 0.1) # Max acceleration factor

###############################
# Main Function
###############################

# Backtest the strategy on the Train dataset (oldest part of dataset), then
# test on Test dataset

for symbol in symbols_list:

    print(str(symbol))

    # Download the datasets and run some checks to make sure they're big enough
    try:
        df = get_stock_data(symbol, 
                        start_date=start_date,
                        end_date=end_date,
                        format=price_to_download, # "Open" by default
                       )
        print(df.head())
    except:
        print('Dataset for ' + str(symbol) + ' could not be downloaded. Skipping it...')
        continue

    # Make sure it starts at least in the same year as you want it to
    if str(df.index[0])[0:4] != start_date[0:4]:
        print('Dataset for ' + str(symbol) + ' isnt big enough (1). Skipping it...')
        continue
    # Make sure the first values in the dataset isn't a NaN!
    if df.iloc[0:2,:].isnull().values.any():
        print('Dataset for ' + str(symbol) + ' isnt big enough (2). Skipping it...')
        continue
    # Make sure the last values in the dataset isn't a NaN!
    if df.iloc[-2:-1,:].isnull().values.any():
        print('Dataset for ' + str(symbol) + ' is cut off. Skipping it...')
        continue

    # Train/Test/Validate Split
    train_df = df.iloc[ : int(len(df)*train_perc), :]
    test_df = df.iloc[int(len(df)*train_perc) : , :]
    print(train_df)
    print(test_df)

    ###############################
    # TRAIN
    ###############################
    if backtest_this_strategy == 'psar':
        train_results = backtest(backtest_this_strategy, train_df, data_format=price_to_download, commission=commission_fee, init_cash=init_cash, 
                                 period=psar_period, af=af, afmax=afmax, verbose=False, plot=show_plots, buy_prop=buy_prop_perc, sell_prop=sell_prop_perc)

    print('train_results :')
    print(train_results)

    # If the top rows have nans from param crossovers, use the next available row with real data
    while True:
        if train_results['pnl'].iloc[0] == np.nan or train_results['rnorm'].iloc[0] == np.nan:
            train_results = train_results.iloc[1:,:]
        else:
            break

    ###############################
    # TEST
    ###############################
    if backtest_this_strategy == 'psar':
        test_results = backtest(backtest_this_strategy, test_df, data_format=price_to_download, commission=commission_fee, init_cash=init_cash, 
                                period=psar_period, af=af, afmax=afmax, verbose=False, plot=show_plots, buy_prop=buy_prop_perc, sell_prop=sell_prop_perc)

    print('test_results :')
    print(test_results)

    # If the top rows have nans from param crossovers, use the next available row with real data
    while True:
        if test_results['pnl'].iloc[0] == np.nan or test_results['rnorm'].iloc[0] == np.nan:
            test_results = test_results.iloc[1:,:]
        else:
            break
enzoampil commented 4 years ago

Isn't the error coming from here? why not just drop na? train_results.dropna()

    # If the top rows have nans from param crossovers, use the next available row with real data
    while True:
        if train_results['pnl'].iloc[0] == np.nan or train_results['rnorm'].iloc[0] == np.nan:
            train_results = train_results.iloc[1:,:]
        else:
            break
windowshopr commented 4 years ago

I FIGURED IT OUT!

It's because of the function I was using for creating ranges with floats, the test run didn't like it for some reason. So I replaced:

psar_period = range(2, 4, 1) # Period used for the PSAR indicator
af = range_with_floats(0.02, 0.04, 0.01) # Step-wise acceleration 
afmax = range_with_floats(0.2, 0.4, 0.1) # Max acceleration factor

...with...

psar_period = range(2, 4, 1) # Period used for the PSAR indicator
af = np.arange(0.02, 0.04, 0.01) # Step-wise acceleration 
afmax = np.arange(0.2, 0.4, 0.1) # Max acceleration factor

and got rid of that range_with_floats function and it worked! I'm thinking maybe it worked as needed during the first run, but when that function was called again, it was trying to "yield" more floats from the end of its last run or something, so it wasn't returning anything when called on during the test run? Either way, that np.arange works for float ranges! Works like a charm now!

I'll upload a full script here later tonight of how I'm using this tool in a backtest strategy using the PSAR :) Thanks!! Talk to you in a few hours!

windowshopr commented 4 years ago

Ok here is my contribution, a bit of an example script. I put a great deal of effort into comments so it's easy to follow for a newbie. I also have some notes near the top comment section about some of the little bugs currently present in it, but as is, it should run for anyone. If one wanted to use the 'psar' strategy, they would need to follow the steps in our above work, but I have it set now to run on any strategy except the sentiment and the bbands as there's a couple of bugs to work out, but anyway!

Basically it's a script that will allow a user to find the best strategy params from a training and testing run, and use those params on the validation set to see how well it "could have" profited. Take a read when you have some time and give it a whirl. Also, check how the sentiment and bbands (np.arange) issue could be resolved?

Thanks a lot! Maybe this could be included in the examples folder once all bugs are worked out? :P Anyway, just my way of giving back. Thanks! Here's the full script:

# Authored by windowshopr. Enjoy!

from fastquant import get_stock_data, backtest, get_bt_news_sentiment
import matplotlib.pyplot as pl
import pandas as pd
import numpy as np
pl.style.use("default")

# BACKTEST A TECHNICAL TRADING STRATEGY BY USING A TRAIN, TEST AND VALIDATION DATASET.

# WARNING: One must always be mindful about overfitting while running a backtest.
#          Just because it performs well in the past, doesn't mean it'll do as well
#          in the future. The point of using both a train and test dataset is to see
#          how well the strategy holds up over time, than use it on a validation dataset
#          to see how well it would have actually performed over that course of time. 
#          Again, care must be taken to not cherry pick great stocks today, if they weren't
#          great during your train/test period because now you've created a look ahead
#          bias. A way to combat this is to keep the test/validation datasets small enough
#          so that it would be more of an accurate representation of how the strategy would
#          have performed under recent market conditions.

# This script grid searches across different strategy parameters (for example, different
# moving average periods) to find the most profitable/best normalized returned combination
# using a training dataset. It then performs the same operation on a test dataset, and then
# takes the average "rnorm" (and 'PnL') of each parameter combination for both runs. 

# Now armed with the best strategy parameters from both runs, use those strategy paramaters
# on a validation set (i.e. the most recent X amount of time in your dataset) to see how well
# the strategy "would have" performed.

# This script is meant to be an entry level boilerplate for anyone looking to use FastQuant's
# library to perform meaningful backtests of technical indicator based trading strategies. The
# code is hardly streamlined/efficient, but it encompasses all the default strategies, plus my 
# locally added PSAR strategy (see here: https://github.com/enzoampil/fastquant/issues/175 
# to create on your machine), and it gives the user an easy to read flow of code. There are 
# improvements that can be made, but leaving as is for now. Also, review that post for an intro
# on how to create your own strategies in the 'strategies.py' file of FastQuant. Enjoy!

# Another note, SENTIMENT strategy needs more testing to get it working with this script.
# It'll run the training session, but not the subsequent testing/valid runs. Leaving for now.

# Also, BBANDS strategy doesn't accept np.arange's for its devfactor variable. Need to pass
# it one value at a time for testing. Maybe FastQuant can adjust code to account for np.arange
# ranges for float values?

###############################
# User Variables
###############################

# List of symbols to backtest
symbols_list = ['AAPL',  'ABBV', 'ABT',  'ACN',  'ADBE', 'AIG',  'ALL',  'AMGN', 'AMT',  'AMZN', 'AXP',  'BA', 
                'BAC',  'BIIB', 'BK',   'BKNG', 'BLK',  'BMY', 'C',  'CAT',  'CHTR', 'CL',   'CMCSA',    'COF',  
                'COP',  'COST', 'CRM',  'CSCO', 'CVS',  'CVX',  'DD',   'DHR',  'DIS',  'DOW',  'DUK',  'EMR',  
                'EXC',  'F',    'FB',   'FDX',  'GD',   'GE',   'GILD', 'GM',   'GOOG', 'GOOGL',    'GS',   'HD',   
                'HON',  'IBM',  'INTC', 'JNJ',  'JPM',  'KHC',  'KMI',  'KO',   'LLY',  'LMT',  'LOW',  'MA',   
                'MCD',  'MDLZ', 'MDT',  'MET',  'MMM',  'MO',   'MRK',  'MS',   'MSFT', 'NEE',  'NFLX', 'NKE',  
                'NVDA', 'ORCL', 'OXY',  'PEP',  'PFE',  'PG',   'PM',   'PYPL', 'QCOM', 'RTX',  'SBUX', 'SLB',  
                'SO',   'SPG',  'T',    'TGT',  'TMO',  'TXN',  'UNH',  'UNP',  'UPS',  'USB',  'V',    'VZ',  
                'WBA',  'WFC',  'WMT',  'XOM'] # S&P100 as of August 2020

# Price to download and pass to Cerebro, open only, close only, or ohlcv?
price_to_download = 'ohlcv' # 'o', 'c', 'ohlcv'

# For historical dataset download
start_date = '2015-01-01'  # yyyy-mm-dd
end_date = '2020-08-04'    # yyyy-mm-dd

# How big to make your 'training' backtest dataset (percentage).
# The test and validation sets will be half each of what's left.
# NOTE: Make sure you have enough rows/days in the test/valid
# dataset chunks for the indicator param ranges to calculate from, 
# otherwise you'll get an "Index out of Range" type error.
train_perc = 0.7

# What do you want the trading commission to be, in $
commission_fee = 0.0075

# Starting account balance
init_cash = 1000 # default is 10000

# Want to see the training, testing and valid plots for each stock?
show_plots = True

# What average metric do you want to optimize for between the train and test runs?
best_metric = 'rnorm' # 'rnorm', 'pnl', 'sharperatio'

# Show each transaction and other data during each run?
# Warning: lots of data will be printed if True
verbose = False

# What percentage of account balance to buy with
# Example, if 10 stocks listed above, and 10,000 init_cash, this 
# variable would be 0.1, or use 10% of total account to buy a position per stock
#buy_prop_perc = float(init_cash/len(symbols_list)/init_cash)
# or, use the below
buy_prop_perc = 1 # 0.1 for 10% of current account balance

# What percentage of position to sell
sell_prop_perc = 1

# The strategy to backtest ('psar' was added locally, won't work unless created in the 'strategies.py' file)
backtest_this_strategy = "smac" # 'smac', 'emac', 'rsi', 'macd', 'bbands', 'buynhold', 'sentiment', 'multi'

# Define some ranges to search across while training strategy picked above. These are by no means 
# 'optimal' values, and remember, the middle number minus the right number will be the last value
# used in the backtest. See the note below.
# MA Cross strategies (and MACD)
fast_period = range(5, 20, 5) # 5, 10, and 15 will be used in this range.
slow_period = range(20, 60, 10)
# MACD
signal_period = range(2, 22, 2)
sma_period = range(10, 210, 10)
dir_period = range(3, 36, 3)
# RSI
rsi_period = range(2, 32, 2)
rsi_upper = range(55, 100, 5)
rsi_lower = range(45, 0, -5)
# BBands
period = range(10, 55, 5)
devfactor = np.arange(2.0, 4.0, 1.0)
# PSAR
psar_period = range(2, 4, 1) # Period used for the PSAR indicator
af = np.arange(0.02, 0.04, 0.01) # Step-wise acceleration 
afmax = np.arange(0.2, 0.4, 0.1) # Max acceleration factor
# Sentiment (haven't tested it)
#keyword = [] # Using the stocks ticker by default
page_nums = 3 # Default
senti = 0.2

# Only used if backtest_this_strategy = 'multi'. Basically, whenever ONE of
# the strategies in this dictionary is triggered, track it.
if backtest_this_strategy == 'multi':
    strats_opt = { 
        "smac": {"fast_period": fast_period, "slow_period": slow_period, "buy_prop": buy_prop_perc, "sell_prop": sell_prop_perc, "init_cash":init_cash, "commission":commission_fee},
        "emac": {"fast_period": fast_period, "slow_period": slow_period, "buy_prop": buy_prop_perc, "sell_prop": sell_prop_perc, "init_cash":init_cash, "commission":commission_fee}
    }

###############################
# Main Function
###############################

# Backtest the strategy on the Train dataset (oldest part of dataset), then
# test on Test dataset, get the average 'best_metric' for all strategy combinations between
# the two runs, and use the best param combo on the Validation DF to see how much 'could have 
# been' profited.

# Store all validation profits in one list to sum at the end
valid_profits = []

for symbol in symbols_list:

    print('***************************' + str(symbol) + '***************************')

    # Download the datasets
    try:
        df = get_stock_data(symbol, 
                        start_date=start_date,
                        end_date=end_date,
                        format=price_to_download, # "Open" by default
                       )
        print(df.head())
    except:
        print('Dataset for ' + str(symbol) + ' could not be downloaded. Skipping it...')
        continue

    # Make sure it starts at least in the same year as you asked it to
    if str(df.index[0])[0:4] != start_date[0:4]:
        print('Dataset for ' + str(symbol) + ' isnt big enough (1). Skipping it...')
        continue
    # Make sure the first values in the dataset aren't NaNs
    if df.iloc[0:2,:].isnull().values.any():
        print('Dataset for ' + str(symbol) + ' isnt big enough (2). Skipping it...')
        continue
    # Make sure the last values in the dataset aren't NaNs
    if df.iloc[-2:-1,:].isnull().values.any():
        print('Dataset for ' + str(symbol) + ' is cut off. Skipping it...')
        continue

    # Now, if sentiment strategy is used, download the sentiments dataset
    if backtest_this_strategy == 'sentiment':
        print('Collecting sentiment data...')
        sentiments = get_bt_news_sentiment(keyword=str(symbol), page_nums=page_nums)

    # Train/Test/Validate Split
    # Train train dataset will be 'train_perc' of the entire dataset, the test will be half
    # of what's left, and the validation set will the other half, validation being the most
    # recent section.
    train_df = df.iloc[ : int(len(df)*train_perc), :]
    test_df = df.iloc[int(len(df)*train_perc) : -int((int(len(df))-int(len(df)*train_perc))/2), :]
    valid_df = df.iloc[-int((int(len(df))-int(len(df)*train_perc))/2) : , :]
    print(train_df)
    print(test_df)
    print(valid_df)

    ###############################
    # TRAIN
    ###############################
    # Start the training session. Depending on which strategy is called for, use the appropraite ranges defined in the
    # beginning to grid search over.
    if backtest_this_strategy == 'smac' or backtest_this_strategy == 'emac':
        train_results = backtest(backtest_this_strategy, train_df, commission=commission_fee, init_cash=init_cash, fast_period=fast_period, slow_period=slow_period, 
                                    verbose=verbose, plot=show_plots, buy_prop=buy_prop_perc, sell_prop=sell_prop_perc) 
    elif backtest_this_strategy == 'rsi':
        train_results = backtest(backtest_this_strategy, train_df, commission=commission_fee, init_cash=init_cash, rsi_period=rsi_period, rsi_upper=rsi_upper, rsi_lower=rsi_lower, 
                                    verbose=verbose, plot=show_plots, buy_prop=buy_prop_perc, sell_prop=sell_prop_perc) 
    elif backtest_this_strategy == 'macd':
        train_results = backtest(backtest_this_strategy, train_df, commission=commission_fee, init_cash=init_cash, fast_period=fast_period, slow_period=slow_period, 
                                    signal_period=signal_period, sma_period=sma_period, dir_period=dir_period, verbose=verbose, plot=show_plots, buy_prop=buy_prop_perc, 
                                    sell_prop=sell_prop_perc) 
    elif backtest_this_strategy == 'bbands':
        train_results = backtest(backtest_this_strategy, train_df, commission=commission_fee, init_cash=init_cash, period=period, devfactor=devfactor, verbose=verbose, 
                                    plot=show_plots, buy_prop=buy_prop_perc, sell_prop=sell_prop_perc) 
    elif backtest_this_strategy == 'buynhold':
        train_results = backtest(backtest_this_strategy, train_df, commission=commission_fee, init_cash=init_cash, verbose=verbose, plot=show_plots, buy_prop=buy_prop_perc, 
                                    sell_prop=sell_prop_perc) 
    elif backtest_this_strategy == 'sentiment': # (needs more testing)
        train_results = backtest(backtest_this_strategy, train_df, commission=commission_fee, init_cash=init_cash, senti=senti, sentiments=sentiments, verbose=verbose, plot=show_plots, 
                                    buy_prop=buy_prop_perc, sell_prop=sell_prop_perc)
    elif backtest_this_strategy == 'multi':
        train_results = backtest(backtest_this_strategy, train_df, commission=commission_fee, init_cash=init_cash, strats=strats_opt, verbose=verbose, plot=show_plots, 
                                    buy_prop=buy_prop_perc, sell_prop=sell_prop_perc)
    elif backtest_this_strategy == 'psar':
        train_results = backtest(backtest_this_strategy, train_df, data_format=price_to_download, commission=commission_fee, init_cash=init_cash, period=psar_period, af=af, 
                                    afmax=afmax, verbose=verbose, plot=show_plots, buy_prop=buy_prop_perc, sell_prop=sell_prop_perc)

    # If the top rows of the training results table have NaNs due to param crossovers, 
    # use the next best row with real numbers for reporting
    while True:
        if train_results['pnl'].iloc[0] == np.nan or train_results['rnorm'].iloc[0] == np.nan:
            train_results = train_results.iloc[1:,:]
        else:
            break

    print('train_results :')
    print(train_results)

    ###############################
    # TEST
    ###############################
    # Start the testing session. Depending on which strategy is called for, use the appropraite ranges defined in the
    # beginning to grid search over.
    if backtest_this_strategy == 'smac' or backtest_this_strategy == 'emac':
        test_results = backtest(backtest_this_strategy, test_df, commission=commission_fee, init_cash=init_cash, fast_period=fast_period, slow_period=slow_period, 
                                verbose=verbose, plot=show_plots, buy_prop=buy_prop_perc, sell_prop=sell_prop_perc) 
    elif backtest_this_strategy == 'rsi':
        test_results = backtest(backtest_this_strategy, test_df, commission=commission_fee, init_cash=init_cash, rsi_period=rsi_period, rsi_upper=rsi_upper, rsi_lower=rsi_lower, 
                                verbose=verbose, plot=show_plots, buy_prop=buy_prop_perc, sell_prop=sell_prop_perc) 
    elif backtest_this_strategy == 'macd':
        test_results = backtest(backtest_this_strategy, test_df, commission=commission_fee, init_cash=init_cash, fast_period=fast_period, slow_period=slow_period, 
                                signal_period=signal_period, sma_period=sma_period, dir_period=dir_period, verbose=verbose, plot=show_plots, buy_prop=buy_prop_perc, sell_prop=sell_prop_perc) 
    elif backtest_this_strategy == 'bbands':
        test_results = backtest(backtest_this_strategy, test_df, commission=commission_fee, init_cash=init_cash, period=period, devfactor=devfactor, verbose=verbose, plot=show_plots, 
                                buy_prop=buy_prop_perc, sell_prop=sell_prop_perc) 
    elif backtest_this_strategy == 'buynhold':
        test_results = backtest(backtest_this_strategy, test_df, commission=commission_fee, init_cash=init_cash, verbose=verbose, plot=show_plots, buy_prop=buy_prop_perc, 
                                sell_prop=sell_prop_perc) 
    elif backtest_this_strategy == 'sentiment': # (needs more testing)
        test_results = backtest(backtest_this_strategy, test_df, commission=commission_fee, init_cash=init_cash, senti=senti, sentiments=sentiments, verbose=verbose, plot=show_plots, 
                                buy_prop=buy_prop_perc, sell_prop=sell_prop_perc) # keyword=, page_nums=,
    elif backtest_this_strategy == 'multi':
        test_results = backtest(backtest_this_strategy, test_df, commission=commission_fee, init_cash=init_cash, strats=strats_opt, verbose=verbose, plot=show_plots, buy_prop=buy_prop_perc, 
                                sell_prop=sell_prop_perc)
    elif backtest_this_strategy == 'psar':
        test_results = backtest(backtest_this_strategy, test_df, data_format=price_to_download, commission=commission_fee, init_cash=init_cash, period=psar_period, af=af, afmax=afmax, 
                                verbose=verbose, plot=show_plots, buy_prop=buy_prop_perc, sell_prop=sell_prop_perc)

    print('test_results :')
    print(test_results)

    ##############################################
    # COMBINE TRAIN AND TEST
    ##############################################
    # Here is where we get the average pnl and rnorm between the train and test runs for each
    # parameter combination. This isn't an optimal way of doing it, but it does in a sense
    # return the best average'd param combinations between the two runs. Feel free to adjust
    # this however you want.
    # First, let's create a combined dataset with some typical metrics used during training run.
    collective_train_test_results = pd.DataFrame()
    collective_train_test_results['buy_prop'] = train_results['buy_prop']
    collective_train_test_results['sell_prop'] = train_results['sell_prop']
    collective_train_test_results['rnorm'] = train_results['rnorm']
    collective_train_test_results['pnl'] = train_results['pnl']

    # Find the best params by averaging the test profits and train profits together for each combination of strategy params.
    # Again, not the most efficient way of doing it, but basically, search the test dataset for matching params, then average
    # the 'pnl' and 'rnorm' columns together.

    # MA CROSSOVERS
    if backtest_this_strategy == 'smac' or backtest_this_strategy == 'emac':
        # Strategy params go here
        collective_train_test_results['fast_period'] = train_results['fast_period']
        collective_train_test_results['slow_period'] = train_results['slow_period']
        for j in range(len(test_results)):
            for k in range(len(collective_train_test_results)):
                # ...and here
                if (test_results['fast_period'].iloc[j] == collective_train_test_results['fast_period'].iloc[k]) and (test_results['slow_period'].iloc[j] == collective_train_test_results['slow_period'].iloc[k]):
                    # Get average PnL and RNorm between training and testing runs for matching strat params
                    collective_train_test_results['rnorm'].iloc[k] = (float(test_results['rnorm'].iloc[j]) + float(collective_train_test_results['rnorm'].iloc[k])) / 2
                    collective_train_test_results['pnl'].iloc[k] = (float(test_results['pnl'].iloc[j]) + float(collective_train_test_results['pnl'].iloc[k])) / 2
                    break
    # MACD
    elif backtest_this_strategy == 'macd':
        # Strategy params go here
        collective_train_test_results['fast_period'] = train_results['fast_period']
        collective_train_test_results['slow_period'] = train_results['slow_period']
        collective_train_test_results['signal_period'] = train_results['signal_period']
        collective_train_test_results['sma_period'] = train_results['sma_period']
        collective_train_test_results['dir_period'] = train_results['dir_period']
        for j in range(len(test_results)):
            for k in range(len(collective_train_test_results)):
                # ...and here
                if (test_results['fast_period'].iloc[j] == collective_train_test_results['fast_period'].iloc[k]) and (test_results['slow_period'].iloc[j] == collective_train_test_results['slow_period'].iloc[k]) and (test_results['signal_period'].iloc[j] == collective_train_test_results['signal_period'].iloc[k]) and (test_results['sma_period'].iloc[j] == collective_train_test_results['sma_period'].iloc[k]) and (test_results['dir_period'].iloc[j] == collective_train_test_results['dir_period'].iloc[k]):
                    # Get average PnL and RNorm between training and testing runs for matching strat params
                    collective_train_test_results['rnorm'].iloc[k] = (float(test_results['rnorm'].iloc[j]) + float(collective_train_test_results['rnorm'].iloc[k])) / 2
                    collective_train_test_results['pnl'].iloc[k] = (float(test_results['pnl'].iloc[j]) + float(collective_train_test_results['pnl'].iloc[k])) / 2
                    break
    # RSI
    elif backtest_this_strategy == 'rsi':
        # Strategy params go here
        collective_train_test_results['rsi_period'] = train_results['rsi_period']
        collective_train_test_results['rsi_upper'] = train_results['rsi_upper']
        collective_train_test_results['rsi_lower'] = train_results['rsi_lower']
        for j in range(len(test_results)):
            for k in range(len(collective_train_test_results)):
                # ...and here
                if (test_results['rsi_period'].iloc[j] == collective_train_test_results['rsi_period'].iloc[k]) and (test_results['rsi_upper'].iloc[j] == collective_train_test_results['rsi_upper'].iloc[k]) and (test_results['rsi_lower'].iloc[j] == collective_train_test_results['rsi_lower'].iloc[k]) :
                    # Get average PnL and RNorm between training and testing runs for matching strat params
                    collective_train_test_results['rnorm'].iloc[k] = (float(test_results['rnorm'].iloc[j]) + float(collective_train_test_results['rnorm'].iloc[k])) / 2
                    collective_train_test_results['pnl'].iloc[k] = (float(test_results['pnl'].iloc[j]) + float(collective_train_test_results['pnl'].iloc[k])) / 2
                    break
    # BBANDS
    elif backtest_this_strategy == 'bbands':
        # Strategy params go here
        collective_train_test_results['period'] = train_results['period']
        collective_train_test_results['devfactor'] = train_results['devfactor']
        for j in range(len(test_results)):
            for k in range(len(collective_train_test_results)):
                # ...and here
                if (test_results['period'].iloc[j] == collective_train_test_results['period'].iloc[k]) and (test_results['devfactor'].iloc[j] == collective_train_test_results['devfactor'].iloc[k]) :
                    # Get average PnL and RNorm between training and testing runs for matching strat params
                    collective_train_test_results['rnorm'].iloc[k] = (float(test_results['rnorm'].iloc[j]) + float(collective_train_test_results['rnorm'].iloc[k])) / 2
                    collective_train_test_results['pnl'].iloc[k] = (float(test_results['pnl'].iloc[j]) + float(collective_train_test_results['pnl'].iloc[k])) / 2
                    break

    # BUY AND HOLD (Doesn't have any params to combine)

    # SENTIMENT (needs more testing)
    elif backtest_this_strategy == 'sentiment':
        # Strategy params go here
        collective_train_test_results['senti'] = train_results['senti']
        # collective_train_test_results['keyword'] = train_results['keyword']
        # collective_train_test_results['page_nums'] = train_results['page_nums']
        for j in range(len(test_results)):
            for k in range(len(collective_train_test_results)):
                # ...and here
                if (test_results['senti'].iloc[j] == collective_train_test_results['senti'].iloc[k]) : # and (test_results['keyword'].iloc[j] == collective_train_test_results['keyword'].iloc[k]) and (test_results['page_nums'].iloc[j] == collective_train_test_results['page_nums'].iloc[k]) :
                    # Get average PnL and RNorm between training and testing runs for matching strat params
                    collective_train_test_results['rnorm'].iloc[k] = (float(test_results['rnorm'].iloc[j]) + float(collective_train_test_results['rnorm'].iloc[k])) / 2
                    collective_train_test_results['pnl'].iloc[k] = (float(test_results['pnl'].iloc[j]) + float(collective_train_test_results['pnl'].iloc[k])) / 2
                    break

    # MULTI
    elif backtest_this_strategy == 'multi':
        # Strategy params go here
        collective_train_test_results['strats'] = train_results['strats']
        for j in range(len(test_results)):
            for k in range(len(collective_train_test_results)):
                # ...and here
                if (test_results['strats'].iloc[j] == collective_train_test_results['strats'].iloc[k]) :
                    # Get average PnL and RNorm between training and testing runs for matching strat params
                    collective_train_test_results['rnorm'].iloc[k] = (float(test_results['rnorm'].iloc[j]) + float(collective_train_test_results['rnorm'].iloc[k])) / 2
                    collective_train_test_results['pnl'].iloc[k] = (float(test_results['pnl'].iloc[j]) + float(collective_train_test_results['pnl'].iloc[k])) / 2
                    break

    # PSAR (custom made locally, again, see here: https://github.com/enzoampil/fastquant/issues/175)
    elif backtest_this_strategy == 'psar':
        # Strategy params go here
        collective_train_test_results['period'] = train_results['period']
        collective_train_test_results['af'] = train_results['af']
        collective_train_test_results['afmax'] = train_results['afmax']
        for j in range(len(test_results)):
            for k in range(len(collective_train_test_results)):
                # ...and here
                if (test_results['period'].iloc[j] == collective_train_test_results['period'].iloc[k]) and (test_results['af'].iloc[j] == collective_train_test_results['af'].iloc[k]) and (test_results['afmax'].iloc[j] == collective_train_test_results['afmax'].iloc[k]) :
                    # Get average PnL and RNorm between training and testing runs for matching strat params
                    collective_train_test_results['rnorm'].iloc[k] = (float(test_results['rnorm'].iloc[j]) + float(collective_train_test_results['rnorm'].iloc[k])) / 2
                    collective_train_test_results['pnl'].iloc[k] = (float(test_results['pnl'].iloc[j]) + float(collective_train_test_results['pnl'].iloc[k])) / 2
                    break

    collective_train_test_results = collective_train_test_results.sort_values(by=[best_metric], ascending=False)
    # or if you want to sort by PnL instead:
    #collective_train_test_results = collective_train_test_results.sort_values(by=[best_metric], ascending=False)
    print('collective_train_test_results :')
    print(collective_train_test_results)

    ###############################
    # VALIDATE - (by using the best params found above (based on average rnorm by default) 
    #             on a forward/validation set to see how much 'would have been' profited)
    ###############################
    if backtest_this_strategy == 'smac' or backtest_this_strategy == 'emac':
        valid_results = backtest(backtest_this_strategy, valid_df, commission=commission_fee, init_cash=init_cash, fast_period=collective_train_test_results['fast_period'].iloc[0], 
                                slow_period=collective_train_test_results['slow_period'].iloc[0], verbose=verbose, plot=show_plots, buy_prop=collective_train_test_results['buy_prop'].iloc[0], 
                                sell_prop=collective_train_test_results['sell_prop'].iloc[0])
    elif backtest_this_strategy == 'macd':
        valid_results = backtest(backtest_this_strategy, valid_df, commission=commission_fee, init_cash=init_cash, fast_period=collective_train_test_results['fast_period'].iloc[0], 
                                slow_period=collective_train_test_results['slow_period'].iloc[0], signal_period=collective_train_test_results['signal_period'].iloc[0], 
                                sma_period=collective_train_test_results['sma_period'].iloc[0], dir_period=collective_train_test_results['dir_period'].iloc[0], verbose=verbose, plot=show_plots, 
                                buy_prop=collective_train_test_results['buy_prop'].iloc[0], sell_prop=collective_train_test_results['sell_prop'].iloc[0])
    elif backtest_this_strategy == 'rsi':
        valid_results = backtest(backtest_this_strategy, valid_df, commission=commission_fee, init_cash=init_cash, rsi_period=collective_train_test_results['rsi_period'].iloc[0], 
                                rsi_upper=collective_train_test_results['rsi_upper'].iloc[0], rsi_lower=collective_train_test_results['rsi_lower'].iloc[0], verbose=verbose, 
                                plot=show_plots, buy_prop=buy_prop_perc, sell_prop=sell_prop_perc) 
    elif backtest_this_strategy == 'bbands':
        valid_results = backtest(backtest_this_strategy, valid_df, commission=commission_fee, init_cash=init_cash, period=collective_train_test_results['period'].iloc[0], 
                                devfactor=collective_train_test_results['devfactor'].iloc[0], verbose=verbose, plot=show_plots, buy_prop=buy_prop_perc, sell_prop=sell_prop_perc) 
    elif backtest_this_strategy == 'buynhold':
        valid_results = backtest(backtest_this_strategy, valid_df, commission=commission_fee, init_cash=init_cash, verbose=verbose, plot=show_plots, buy_prop=buy_prop_perc, 
                                sell_prop=sell_prop_perc) 
    elif backtest_this_strategy == 'sentiment':
        valid_results = backtest(backtest_this_strategy, valid_df, commission=commission_fee, init_cash=init_cash, senti=collective_train_test_results['senti'].iloc[0], # keyword=collective_train_test_results['keyword'].iloc[0], page_nums=collective_train_test_results['page_nums'].iloc[0],
                                verbose=verbose, plot=show_plots, buy_prop=buy_prop_perc, sell_prop=sell_prop_perc)
    elif backtest_this_strategy == 'multi':
        valid_results = backtest(backtest_this_strategy, valid_df, commission=commission_fee, init_cash=init_cash, strats=collective_train_test_results['strats'].iloc[0], 
                                verbose=verbose, plot=show_plots, buy_prop=buy_prop_perc, sell_prop=sell_prop_perc)
    elif backtest_this_strategy == 'psar':
        valid_results = backtest(backtest_this_strategy, valid_df, data_format=price_to_download, commission=commission_fee, init_cash=init_cash, 
                                period=collective_train_test_results['period'].iloc[0], af=collective_train_test_results['af'].iloc[0], afmax=collective_train_test_results['afmax'].iloc[0], 
                                verbose=verbose, plot=show_plots, buy_prop=buy_prop_perc, sell_prop=sell_prop_perc)

    print('valid_results :')
    print(valid_results)

    # Append our validation profits for this stock to the list for later summing
    valid_profits.append(float(valid_results['pnl'].iloc[0]))

    print('Sum of valid_profits (SO FAR) is :')
    print(sum(valid_profits))

print('==================================================================================')
print('valid_profits list is...')
print(valid_profits)
print('==================================================================================')
print('Validation period was from ' + str(valid_df.index[0]) + ' to ' + str(valid_df.index[-1]))
print('==================================================================================')
print('Final validation period profit (of the ' + str(len(valid_profits)) + ' stocks successfully downloaded datasets for) was :')
print(sum(valid_profits))
print('==================================================================================')
enzoampil commented 4 years ago

Thanks heaps @windowshopr ! I can add these to the example scripts and will make sure to reference you and this issue in the notebook :smile:

BTW this also seems to be a great example of how to do a train test split with backtesting on fastquant. Awesome, I see a lot of people getting value from this :grin:

Hope to see you again soon!