How much historical data to use?

tommedema commented 1 year ago

Great library, I love the pseudocode and python implementation.

I read the paper at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3892335 but cannot figure out what the recommended number of historical days to use is. Is 3 enough, a month, a year? Is it possible to use too much data if I just want to estimate today's spread?

eguidotti commented 1 year ago

Hi @tommedema, thanks for asking!

The estimator estimates the average spread in the estimation period. So if you use the last 3 days, you get the average spread in the last 3 days. If you use one month or one year, you get the average spread in the month or in the year. Of course, the less data you use, the less precise the estimate will be, as the spread would be computed from fewer observations.

Put simply: each OHLC (open, high, low, close) candle is one observation. And you need at least a few observations to get a sensible estimate. So if you are using daily prices, you need at least a few days. If you are using minute prices, you need at least a few minutes.

Assuming you are using daily prices, 3 days would give you estimates closer to the spread in the last days but with large estimation uncertainty. Using the last year would give you more precise estimates, but for the average spread in the last year (which may not be your goal). I would say that using one month of daily data is pretty common in the literature, but ultimately there is no one-size-fits-all solution.

If you have intraday prices, best option would be using the estimator with all prices in a day to get the average spread in that day.

Hope this helps!

tommedema commented 1 year ago

@eguidotti great answer, thanks for explaining it so well.

DominusMaximus commented 1 year ago

Hi Mr. Guidotti. I found myself with the same question as tommedema. I would love to see a table showing the mean errors of daily estimates when calculated from a single day's OHLC (the previous day's), up to a month. Or if you could at least tell us which period happens to result in the least amount of error. It would be a lot more satisfying than using the recommendation of what's "pretty common in the literature", considering the results of your excellent estimator are not common in this literature!

eguidotti commented 1 year ago

Hi @DominusMaximus, thanks for reaching out. I'm working on a new version of the paper and will try to include a table like the one you suggest. I'm re-opening this issue and will post here when the update is available. Thanks for your idea!

DominusMaximus commented 1 year ago

Awesome.

impredicative commented 9 months ago

Currently the package returns a single value. As a user, the interface we need is where the input is an OHLC matrix (four columns), and the output is a Bid Ask matrix (two columns). Essentially, the Bid and Ask are to sequentially be estimated for each row, also without ever looking at any subsequent row.

As per a prior comment:

The estimator estimates the average spread in the estimation period. So if you use the last 3 days, you get the average spread in the last 3 days.

As a user, ideally I don't want the average spread across the estimation period. Instead I want the estimated rolling value(s) for each row of the period.

eguidotti commented 9 months ago

@impredicative As a user, you are free to apply the estimator using a rolling window or any subsample that you like. The estimator does need at least 3 rows to output a spread estimate. And the output is the average (more precisely, the root-mean-squared) spread within the estimation period. A value of 0.01 corresponds to a spread of 1%.

If you need bid and ask prices for each single row in the period, than you need quote data and not this package.

impredicative commented 9 months ago

I got this working with Pandas with a rolling window:

from typing import Optional

import bidask
import pandas as pd

_WINDOW_DEFAULT = 5  # Note: Lower value of 3 resulted in NaN estimates for SQQQ.

def _estimate_spread(ser: pd.Series, df: pd.DataFrame) -> float:
    df_roll = df.loc[ser.index]
    return bidask.edge(df_roll['open'], df_roll['high'], df_roll['low'], df_roll['close'])

def add_estimated_bid_and_ask(df: pd.DataFrame, /, *, source: str = 'close', min_spread: Optional[float] = 0.01, **kwargs) -> None:
    if 'window' not in kwargs:
        kwargs['window'] = _WINDOW_DEFAULT
    assert kwargs['window'] >= 3  # bidask requires it.
    est_spread_fraction = df['close'].rolling(**kwargs).apply(_estimate_spread, args=(df,))  # Depending on use case, can specify `center=True` for rolling.
    df[f'est_{source}_spread'] = df[source] * est_spread_fraction
    if min_spread is not None:
        df[f'est_{source}_spread'].clip(lower=min_spread, inplace=True)
    est_close_spread_half = df[f'est_{source}_spread'] / 2
    df[f'est_{source}_bid'] = df[source] - est_close_spread_half  # Sell price
    df[f'est_{source}_bid'].clip(lower=0, inplace=True)
    df[f'est_{source}_ask'] = df[source] + est_close_spread_half  # Buy price

add_estimated_bid_and_ask(df)

eguidotti commented 9 months ago

Sure, you can apply edge like that to compute rolling estimates in python. However, note that the way you compute bids and asks is most likely to be incorrect because the closing price is typically either a bid or an ask and not the midpoint. At the minimum, you should use the average of the high and low prices as the reference price to round by half the spread (i.e., your source parameter). That should give an idea of the bid and ask prices at some point in time. However, this is only a very heuristic approach and I would avoid doing that whenever possible. Indeed, edge is designed to estimate the effective spread and not bid and ask prices. See also https://github.com/eguidotti/bidask/issues/5

impredicative commented 9 months ago

At the minimum, you should use the average of the high and low prices as the reference price to round by half the spread (i.e., your source parameter).

To me this is unclear and ambiguous, leaving much room for interpretation. Code will help.

edge is designed to estimate the effective spread and not bid and ask prices.

The package name is bidask. The package should live up to its name. When backtesting, all three matter: bid, ask, spread.

eguidotti commented 9 months ago

That doesn’t seem related to any issue with the package itself. Feel free to open another issue and clarify your question and I’d be happy to help

eguidotti commented 6 months ago

@DominusMaximus The new version of the paper is now available at SSRN!

Regarding your question:

I would love to see a table showing the mean errors of daily estimates when calculated from a single day's OHLC (the previous day's), up to a month. Or if you could at least tell us which period happens to result in the least amount of error. It would be a lot more satisfying than using the recommendation of what's "pretty common in the literature", considering the results of your excellent estimator are not common in this literature!

One option would be to proceed as follows. First, apply the estimator on the full time series to get a first estimate. Then, compute the standard deviation of the returns to get an estimate of volatility. You should also have an idea of how many times the asset is traded per day. At this point you can simulate a price process with these parameters to find a good estimation window to use.

Below, you can find a code snippet that simulates a price process with spread 1% and volatility 3% for an asset that is traded once per minute (390 times per day). The code then estimates the spread with several estimation windows and plots the bias, variance, and root mean squared error (rmse) of the estimates.

One month of trading corresponds to 21 days and the figures show that it is a reasonable default. Indeed, the estimation window must be increased a lot to get any substancial reduction in the estimation error. On the other side, the error explodes when the estimation window is too short.

Ultimately, the optimal estimation window depends on the spread, volatility, and trading frequency, and it is specific to the use case. I hope the code below helps to play with these parameters and better understand how much data to use for estimation:

library(bidask)
set.seed(123)

spr <- 0.01  # 1% bid-ask spread
vol <- 0.03  # 3% daily volatility
trd <- 390   # 390 trades per day

# Simulate 10000 open/high/low/close daily prices
x <- sim(n = 10000, trades = trd, spread = spr, volatility = vol)

# Estimate the spread with several estimation windows
metrics <- sapply(3:252, function(width){
  s <- spread(x, width)
  c(
    'width' = width,
    'bias' = mean(s - spr),
    'variance' = var(s - spr),
    'rmse' = sqrt(mean((s - spr)^2))
  )  
})

# Plot the results
plot(x = metrics['width',], y = metrics['bias',], log = "x", main = "Bias", xlab = "Estimation window (days)", ylab = "Bias")
plot(x = metrics['width',], y = metrics['variance',], log = "x", main = "Variance", xlab = "Estimation window (days)", ylab = "Variance")
plot(x = metrics['width',], y = metrics['rmse',], log = "x", main = "Root Mean Squared Error", xlab = "Estimation window (days)", ylab = "RMSE")

Rplot02 Rplot01 Rplot

eguidotti / bidask

How much historical data to use? #2