X-DataInitiative / tick

Module for statistical learning, with a particular emphasis on time-dependent modelling
https://x-datainitiative.github.io/tick/
BSD 3-Clause "New" or "Revised" License
487 stars 106 forks source link

Gaps in Kernels based on HawkesConditionalLaw #193

Closed Jeanieiszo1 closed 6 years ago

Jeanieiszo1 commented 6 years ago

Hi,

For a study project, I am fitting high frequency order book data to a Hawkes model in the same manner as Bacry(2015) and encountering the following problem: Using the non parametric estimation based on conditional law gives discountinuous functions (like in the example 'Fit Hawkes power law kernels').

For my project, I need to compute intensities and this requires the kernels to be evaluated at certain points which is now impossible. So my question is: what gives these NaN's? Is it because the conditional laws already contain NaN's? Or is it caused by a quirk in the quadrature method? I tried to find the cause in the source code, but got a bit lost. My last question would be: Can it easily be solved by some kind of interpolation?

Any help would be greatly appreciated! :) Thanks, Jean-Claude

mrambaldi commented 6 years ago

Hi Jean-Claude,

thank you for you for you interest in the tick library!

You're probably seeing “discontinuities” in the loglog plot. The reason is that your kernels take on negative values, thus the log is nan and matplotlib will not plot these point in a log-log scale. Indeed, the non parametric estimation procedure that use the conditional law can find negative-valued kernels. This is actually a powerful feature of the algorithm as it allows to model (moderate) inhibition effects.

The mathematical reason we can find negative kernels is as follows. For simplicity, let’s stick to 1D case. The first step of the procedure is to empirically estimate the conditional law

image

where N is the counting process and

image

and Lambda is the average intensity of the process.

Now, intuitively, if some inhibition effect is present for some t, we will have that

image

i.e. you expect less event than average, and thus g(t) < 0 for some t. It has been proved in [1] that this also imply that the kernel will take negative values for some t. See also [3] for all the details about the estimation method.

In practice with real data, one can find negative values of g(t) for 2 main reasons:

  1. There are genuine inhibition effects in the data. This is for example the case for the negative relation found in [2] between P+ and P+ events as we know the mid-price is mean reverting and thus another P+ event is more unlikely just after a P+ event has taken place.

  2. Too fine grid in estimation of g(t). Real data come with a finite time resolution and furthermore we don’t have an infinite amount of data. Estimation of g(t) basically is like estimation of a conditional probability density. You divide your domain in discrete bins and count how many event you have in each bin. If you take very small bins compared to data resolution/amount, you run into the risk that many of them will be empty or scarcely populated so E[dN_t|dN_0] < Lambda and then you gent negative kernels.

To give you a concrete example, suppose your data come with 1s time resolution. This means that in your data no two timestamps will have a time distance below 1s. Now, if we estimate g(t) on a grid with millisecond bins, all the bins between 0 and 1s will be empty and thus g(t) will be negative on this domain.

This doesn’t mean that the actual kernels have gaps, it’s just that they take negative values. Here the cause it’s most likely point 2 above.

Some practical suggestions for the use on high-frequency data

I would suggest the following:

  1. take a look at the inter event time distribution of your data (i.e. the differences ti - t{i-1}) to see what the typical time scales are. This will help you in setting the “min_lag” parameter. For financial high-frequency data you’ll probably have something like (x scale is log of the inter event times):

image

as you see there is a lot of structure up to about 1e-3 s so we might want the linear grid to go up to there (min_lag = 1e-3). For this data I then took delta_lag between 1/50 and 1/10 depending on how many components the Hawkes process we want to fit has. The lesser the dimension the more fine the grid (smaller delta_lag) could be. Also I set max_lag to 250 - 1000.

Remember that the grid used for estimation of g(t) is given by:

0, min_lag delta_lag, min_lag 2 delta_lag, … min_lag, min_lag exp(delta_lag), min_lag exp(2 delta_lag) …, max_lag

Thus a first rule of thumb would be to check that delta_lag * min_lag is large enough so that you have a sufficient number of inter-event times in each bin. In future versions of tick we are going to allow the user to pass directly the grid, so that it can be better adapted to his particular data.

  1. Try first with a 1D process as it will be much faster and you’ll have less noise and then as you increase the dimension you adjust the parameters.

  2. Regarding the compute parameters i.e. n_quad, min_support, and max_support, these will determine the points at which the kernel values will be computed by discretising the Wiener-Hopf system. When using the ‘log’ scheme the grid is given by

0, min_support eta, min_support 2 eta, … min_support, min_support exp(eta), min_support exp(2 eta), … max_support

with

eta = ( log(max_support) - log(min_support) + 1) / n_quad

For the above data I used n_quad=120, max_support=1, min_support=5e-5. Note in particular that if you set a very large value for max_support (like hundreds of seconds for high frequency financial data) you will invariably have a lot of noise in the tail of the kernels which can become a problem and will lead to instable values for the norm. I suggest starting with smaller values like 1 second or less for high-frequency data. Try also to lower the number of quadrature when you increase the dimension (like 80-100 for 8D).

  1. If you have multiple realisations (e.g. multiple days) don’t forget to fit using
for real in realizations:
    hawkes_learner.incremental_fit(real , compute=False)
hawkes_learner.compute()

so that the code will correctly estimate g(t) by averaging over all realisations and compute the kernels just once at the end.

  1. finally, it can help sometimes to plot using matplotlib directly and a semilogx plot both the conditional law and the kernels.

to plot kernel i,j:

plt.plot(hawkes_learner.kernels[i][j][0], hawkes_learner.kernels[i][j][1])
plt.semilogx()
plt.xlabel('t')
plt.ylabel('Kernel value')

to plot conditional law i,j

plt.plot(hawkes_learner._claw_X, hawkes_learner._claw1[i][j])
plt.semilogx()
plt.xlabel('t')
plt.ylabel(‘Claw value')

References

[1] Rambaldi M, Bacry E, Lillo F. The role of volume in order book dynamics: a multivariate Hawkes process analysis. Quantitative Finance. 2017; 17(7):999-1020. [2] Bacry E, Jaisson T, Muzy JF. Estimation of slowly decreasing Hawkes kernels: application to high-frequency order book dynamics. Quantitative Finance. 2016; 16(8):1179-201. [3] Bacry E, Muzy JF. First-and second-order statistics characterization of Hawkes processes and non-parametric estimation. IEEE Transactions on Information Theory. 2016; 62(4):2184-202.