Bidask is often estimated at zero?

tommedema commented 2 years ago

Take this example python code:

from bidask import edge

o = np.array([29.926592, 28.242779, 27.170715, 27.953915, 26.39432, 26.888105,
       26.966768, 27.62503, 27.7497, 28.776056, 29.558401, 29.762136,
       30.09086, 29.762465, 29.762665, 30.051258, 29.926998, 29.847376,
       27.624899, 27.624887, 30.505402, 30.419634, 30.709461, 30.670303,
       31.242111, 32.847298, 35.681507, 41.43737 , 49.987385, 52.289425],
      dtype=np.float32)
h = np.array([32.886364, 28.282244, 28.94, 28.69716, 27.584791, 27.295898,
       27.913893, 28.243303, 28.571865, 29.59823, 30.051699, 30.54483 ,
       30.255291, 30.012405, 30.216503, 30.215687, 30.420301, 29.847376,
       27.828798, 31.5713, 30.992119, 30.87346, 30.709461, 31.282,
       32.886433, 34.98494, 40.45, 52.00061 , 52.618298, 52.578827],
      dtype=np.float32)
l = np.array([26.309092, 26.184095, 26.348557, 26.474, 26.308815, 26.802599,
       26.67737, 26.927826, 27.545805, 28.407722, 29.144032, 29.722672,
       29.14374, 29.473063, 29.762665, 29.433, 29.190334, 27.663742,
       26.967163, 27.624887, 29.887136, 30.340708, 30.255629, 30.459827,
       31.202648, 32.597363, 35.47761, 41.397907, 48.836357, 50.151806],
      dtype=np.float32)
c = np.array([28.94, 26.559, 28.94, 26.474, 27.006, 27.092, 27.71, 27.75 ,
       28.243, 29.026, 29.269, 30.091, 29.762, 29.723, 30.131, 29.433,
       29.927, 28.282, 27.421, 31.078, 30.216, 30.834, 30.67, 31.282,
       32.722, 34.735, 40.45, 47.357, 52.125, 50.849], dtype=np.float32)

spread = edge(o, h, l, c)

print(spread)

assert spread != 0

The spread here is estimated at zero, which seems unlikely. This is real stock data for 30 days of a random stock I pulled (split adjusted prices for ticker A with last entry being on 1999-12-31).

tommedema commented 2 years ago

When applying a rolling window of 30 days on this stock you can see the zero computations is very common:

eguidotti commented 2 years ago

Yes, the issue is that the estimator is technically an estimator for the squared spead, which may become negative in finite samples. As we cannot take the square root of a negative value, we reset to zero negative spread estimates. This is a common issue in the literature. The good news is that edge produces fewer zero estimates compared to other methods (although in some cases the "zero computations is very common" as you correctly observed).

Please have a look at the paper:

Section 1.2.1
Tables 3 & 4

tommedema commented 2 years ago

I see, thank you. For now I am dynamically adjusting the date range until a non-zero value exists. If that's not possible, I forward fill from prior estimations.

eguidotti commented 2 years ago

Mmm that's a bit dangerous because, on average, it would create an upward bias in the estimates. Depending on the use case, I would recommend the following:

1) If you are averaging the spreads somehow (e.g., average spread in a portfolio, or regression analyses), I would keep the zero estimates. Although they make little economic sense, they are more correct statistically. Indeed, this option reduces the upward bias that you would have by imposing a positive spread estimate. So the final results of the use case should be more correct

2) If you are interested in point estimates (e.g., best guess of the spread of a stock in a month, conditional on a positive estimate), then I would take the absolute value of the (negative) spread estimate instead of resetting it to zero. I found this option to work quite well in some preliminary studies on the US stock market, although it is too early to release it officialy. To do that in python, do not use the bidask package on PyPI. Instead, copy/paste this function in your code and replace the final line:

return float(max(0, s2) ** 0.5)

with:

return abs(s2) ** 0.5

Hope this helps!

tommedema commented 2 years ago

That makes sense, I'll go with option 2. Much appreciated!

eguidotti / bidask

Bidask is often estimated at zero? #3