braverock / blotter

blotter provides transaction infrastructure for defining transactions, portfolios and accounts for trading systems and simulation. Provides portfolio support for multi-asset class and multi-currency portfolios. Actively maintained and developed.

114 stars 50 forks source link

Add Kissell-Malamut I-Star model #97

Closed vi-to closed 5 years ago

vi-to commented 5 years ago

The I-Star model framework, variables, its equations, model estimations, interpretations and other details are briefly discussed in the docs. Some theoretical implications are also touched on in #54, but they are not included in the documentation at the moment as are wider in scope than model/function usage. With sufficient data it would be interesting to carry experiments on those empirical findings and this could have a place in the vignette.

Irrespective of different plausible functional forms, the model version setting implemented in iStartPostTrade() is entirely based on historical "tic data". Several implicit assumptions on data are required to guarantee consistency of results. Such assumptions are often concerned with potential issues that professional users are likely not to encounter at all (e.g. availability of spreads data, access to reliable databases queries, etc.), however this code is meant for a larger public and thus aspects above or related ones are highlighted in the discussion. A description of the ideal dataset to input is in @details.

As usual, a crucial aspect is about dates, especially keeping in mind the large amount of data into play. Many checks and adjustments are already in place, to prevent or provide hints on potential irregularities that would result in inconsistent or biased quantities computations and analyses. Your close review on how dates are managed throughout all the dimensions involved is sought after. Furthermore, I would like to propose the timeDate package as a valuable dependency. It is not strictly necessary, but is conceived to work with financial time series and includes a wealth of useful functions that in this context I would use to infere business days within a given period and put in place further precise checks on input data to guarantee reliability.

A general comment I shall provide on code written is that vectorisation and looping lives together, sometimes sharing the work. Precedence was given to readability so far. If you can see simplifications that would help us having a good balance, please point it out and I will be happy to work on them.

In addition to review remarks, next commits will mainly address two necessary steps to have a minimum complete model implementation:

parameters estimation methods, first of all extending the nonlinear least squares code already in place and then evaluating if another couple of methods can be a suitable alternative;
use extended methods above to consequently extend Instantaneous impact and Market impact estimates in light of the data grouping carried;
enrich the docs and complete TODOs left there.

Also, there will be various revisions in order to best organize our output in view of allowing further analyses extensions (sensitivity analysis, error analysis, etc.) and their visualization.

Disclaimer: in accordance to transparency and diligence principles it is important to point out that the model has a MATLAB implementation (https://www.mathworks.com/help/trading/krg.istar.html), supported by Kissell - one of the authors of the model itself - and his research group. It is worth mentioning that this last implementation is not "free software" and therefore comparisons with respect to our GPLv3+ licensed distribution cannot be done. In other words, their source code seems to fall under the definition of "proprietary (nonfree) software", it lacks essential free software freedoms and cannot be studied as that source code is not publicly available (see https://www.gnu.org/philosophy/free-sw.html for reference). The present implementation has been realized starting from the cited references of the mentioned authors.

vi-to commented 5 years ago

Thank you @jaymon0703. Below some comments based on yours, they are only roughly ordered as often many aspects are involved at the same time.

On removing outliers there is a perceived difficulty with respect to what suggested by Kissell. It seems a bit vague part, but inclusion/exclusion criteria are provided and they are based on data-points rolling variables. Are we supposed to recompute the corresponding rolling variables, after having excluded these outliers data-points? It seems that they may well become time-inconsistent (for example, how do we get the first 30-days ADV if we excluded, say, days 23 and 27?).

Nonlinear models estimation is not easy on its own, bigger datasets than the one used for testing purposes can be helpful to reach parameters stability. Especially in light of the grouping, where small samples would likely conduce to nls() convergence failures. I would be happy to thoughtfully consider the intraday based data-points splitting with you and to work on it; however, in spite of the fact the author mentions it with respect to his dataset, variables are almost always referred to on a day-by-day basis. Even considering intraday periods, generally it is stated "three periods per day" (see "Number of data points"), whereas when it comes to apply them on the variables we suddenly face "morning" and "afternoon" intraday periods only (see for example the imbalance).

Table 5.6 seems to be underlying figure 5.7, basically the latter is a representation of the former. In this view, I suspect the y-axis "cost" is the market impact (estimated) cost.

jaymon0703 commented 5 years ago

Outliers: we will have to apply a new process for identifying and removing outliers before building the vectors of the relevant variables. This will mean there will be differing lengths of variables vectors between stocks. We should be able to handle that anyway, as the input data may include recently listed stocks, stocks that have delisted or been suspended or halted trading for one or more days, etc. I would suggest building ADV for each element in the input data list, then comparing total volume against 3xADV (and we could make outlier threshold a parameter, in our case for ADV the threshold is n=3) and storing the index of each item above the threshold. Once complete, remove the observations (an entire time period, like day or morning or afternoon etc) matching that index.

Splitting data: this is a "free option" for increasing the size of the input dataset so should be quite useful for finding more stable parameters. We will have to split the data before outliers are removed, as there may be outliers in Morning or Afternoon time periods that do not present as outliers in the Daily period. Once outliers are removed, we can build index of endpoints based on time (currently we do this for 'days' in line 99). The time can either be inferred from the data, computing duration of a full day in our case 8 hours and adding 4 hours to earliest hour to make 9-1 and adding 4hours again to get time period 1-5. It may be easier to allow user to specify the time splits, on which we base our subsetting for intraday Morning and Afternoon time periods.

Once we have the correct indexes for intraday time periods, we can compute secMktDataMorning and secMktDataAfternoon (or something equivalent) to capture total volume for the relevant time period as well as the close price in the time period. Arrival Price will need some time-based subsetting as well, using all the secMktData as we currently do for Arrival Price at the moment.

My hope is that by removing outliers and adding more data we can get more stable parameters. It may be worth adding more months of data to our current dataset of 5 months. Perhaps adding 9 months will give us at least 12 months of parameter training data. We will run into the data problem of stocks in the dataset not existing 12 months ago (due to new listings and ticker symbol changes). I can add this data next week. With the data we currently have, I have run the function for varying values of horizon, from 5 to 45 in increments of 5 and the parameter estimates and their corresponding p-values move around quite a bit. Once we remove outliers, add more data and time periods it will be interesting to see if these values vary by less. Of course, parameter sensitivity analysis will aid this process in helping us set appropriate parameter constraints, although measuring against non-linear R-squared seems less than desirable (https://www.researchgate.net/post/How_to_assess_goodness_of_fit_for_a_non-linear_model). Since the R-squared metric is not outputted with nls(), we will either have to compute it or use a different metric to assess goodness of fit. I may be completely wrong here, but using RMSE or MAE may be good for this purpose?

braverock commented 5 years ago

On time splitting, I agree this should be a user parameter.

Further, I think that the splits should be either an integer 2, 4', etc., in which case the function could compute the split, or they should be a vector of ISO Time subsets:

c('T09:00/T12:59','T13:00/T17:00')

Note that we will need to either handle in the docs, or write code for, the case where the user desires a subset that crosses midnight. The ISO standard does not define a time-only subset crossing two or more days, so this is not implemented in xts. The usual solution to this is to create a subest, e.g. ['T17:00/23:59'] and another subset ['T00:00/T07:59'] and then combine them via rbind. This will still cause issues for your calculations of ADV, of course, as sorting out what block to use can get quite tricky when you can't split on days.

vi-to commented 5 years ago

Outlier analysis

On the outliers analysis, I shall first of all notice a potential source of confusion: Kissell ("Solution technique - Research the Problem", ch. 5) reports "Third, we did not discard data points. All data points should be included and observed to fully understand the underlying data set and system at hand - including outliers.", which seems to be in contrast with "To avoid potential issues resulting from outliers we filtered our data points..." stated a few pages later. This said, even if we are to carry it, it feels like more precise specifications are needed. Should we compare the daily volume with an overall ADV, or should we compare each daily volume against the corresponding rolling ADV value? The latter does not seem feasible, as different ADVs could contrast on the outlier nature of the same data-point. Also, the former appear to have issues on its own: further analyses we implemented until now do not directly require data-points themselves, but rather the corresponding (rolling) variables, excluding some data-points (daily observations) means excluding their corresponding (rolling) variables value. Now, if we previously computed ADVs and want to keep them, their values are time-consistent but somewhat biased in that current outliers were included; on the other hand, if we recompute ADVs then they can (often) result to be time-inconsistent, because some daily observations are outliers being excluded. Besides, with respect to each security this will produce unpredictable lengths of (rolling) variables vectors and furthermore unpredictable samples compositions (or full data set composition), that is a security may be over/under represented in data used for the regression. In all above circumstances, I am unsure if working with variables produced this way is meaningful and I would like to understand if this is what we want to achieve in the first place. These concerns hold similarly for the other outlier criterion.

Data splitting (augmentation)

On splitting the data set on an intraday level, I would follow your indications. In this view, it looks like all the rolling and non-rolling variables involved needs to be rescaled. In this context, does close-to-close mean "from 'morning close' to 'afternoon close'"? In other words, are we meant to port everything on half-days? Midnight crossing times. With respect to cases @braverock posed, I was wondering if there are further explanations on the context a user would need to do it. Specifically, if this has something to do with market hours. I am imagining a first level splitting, after which endpoints() and its derived expressions would allow to obtain the variables by the new reference timeframe (e.g., morning and afternoon).

braverock commented 5 years ago

Outlier Filtering

This is a tricky problem.

In most Robust Statistics, the outliers are not removed, but rather shrunken towards the center of the distribution. See for example shrinkage methods, or ridge regression, which performs this type of shrinkage automatically.

When dealing with returns data, we have proposed a method which retains the structure of the data under evaluation in a manner which does not create look ahead or data snooping biases. See ?clean.boudt for a full explanation.

I would suggest two things. First, let's try to sort out what Kissell says they are doing, and specify and code that. Second, because discarding data does have some pretty serious problems, we should come up with a more robust method. I would suggest we adapt the methodology used in clean.boudt, and make that available.

Midnight Crossing Times

It all depends on your frame of reference. There is no time zone in the world where at least one major global market's hours of operation will not cross midnight, at least some of the time. GMT+2 comes closest, but even that will have some US markets opening at 23:00 (11pm) during daylight savings time or summer time mismatch hours a few weeks a year. Even in the local time zone, the CME market day is 17:00 to 16:00, with the second day being the settlement day and statement day. So 'Monday' markets pre-open at 16:00 Sunday, and Open at 17:00 Sunday. Another example is US markets from the viewpoint of a European observer. Primary US stock market hours open just before midnight in many European time zones. Most market data vendors will provide data in UTC/GMT, and most market data users with a global viewpoint will convert any data which does not arrive in GMT/UTC to a common reference time zone before storing it in any case. xts is very good at converting time zones, but it is also tricky to keep everything straight, especially if some calculation requires you to leave xts.

jaymon0703 commented 5 years ago

Regarding the outlier confusion you mention, important differentiation is the step in the Scientific Method and the data to which Kissell refers. When researching the problem, including all data, the data Kissell refers to is actual customer order data. In the hypothesis testing step in the scientific method, Kissell refers to excluding outliers in the data used for the hypothesis which is all publicly available tick data for the S&P1500.

As for implementation, you mention 2 concerns: 1. the actual implementation and 2. the inevitable differing lengths of observations per security.

1. Implementation

I think we will need to do your latter implementation. Having applied my mind to it briefly, i think it could work as a completely separate process. 1. Compute rolling ADV, compare with total volume on the same (or following day to prevent look-ahead bias) "timestamp". Store index of outliers, where index is the datetime timestamp in which outlier was observed. 2. Remove outliers. 3. Repeat above steps for Volatility. 4. Recompute ADV, Volatility and compute the remaining rolling variables.

2. Differing Lengths

This should not be a problem for the sake of the model, since we include variables as vectors in the nls() formula specification. We should also be able to handle differing numbers of observations per stock anyway given that some stocks will have less data (new listing, delisting, halted/suspended days etc).

On splitting data, close-to-close for Morning should be the prior Morning last observed price to current Morning last observed price. Same for Afternoon...last observed from prior day to current day last observed price in the Afternoon. This overnight element is important, as securities adjust to overnight events which is reflected in price and volume movements at the next day open, which will be captured by imbalance and VWAP, Arrival Cost etc.

The securities we are interested in for this project are most commonly used for agency execution algorithms and therefore stocks which trade on securities exchanges. No stock exchange is open 24 hours so i do not see this as a concern. Some exchanges such as the Stock Exchange of Hong Kong do have lunch breaks though (12pm-1pm), so allowing users to split data by ISO time would be useful to allow them to specify end of Morning session and start of Afternoon session.

EDIT: From @braverock and related to the comment above - "BATS, NYSE ARCA, and Nasdaq OMS are only closed from 17:00-18:00 NYC time the opening and closing auctions still exist, and retail orders are typically only allowed during the old floor hours to protect retail customers from overnight volatility and illiquidity, but the markets are open and trading"

I believe the data we are most interested in using for the modeling will be the core trading session times which should be most liquid for these venues, and is likely why Kissell uses 9am-4pm for the S&P1500. The data is from 2010, so its also possible that these were the only available opening time, and hence Kissell does not mention it? Nevertheless, for venue open times this reference may be useful - https://www.worldtimezone.com/markets24.php

vi-to commented 5 years ago

Outlier analysis

As usual, overfitting the model is another serious issue to keep into account in the trade-off. I fundamentally agree with @braverock 's view, an explicative example is from clean.boudt docs: "It is also important to note that the robust method proposed here does not remove data from the series, but only decreases the magnitude of the extreme events.". Also, although both the "implementation" and "differing lengths" are technically feasible (the latter is already allowed by current function structure), what I was really asking about is their validity and use throughout. As said, outlier classification criteria are only vaguely provided and as previously reported even if we interpret them in a way rather than another they seem to conduce to some sort of inconsistency (within the same variable or when it comes to build a regression data set). Because of the above, if we agree I would refrain from naive exclusions of data-points, as appears to be in contrast with the standard robust methods approach; if the outlier filtering has to be carried in the end, I would follow what Brian suggested and seek those kind of solutions.

Data splitting (augmentation)

Having the goal of enriching the data-set, why close-to-close should be from "morning" to "next-morning" and from "afternoon" to "next-afternoon"? I was expecting at least a half-days re-scaling, which would double the original data set in terms of data-points. Midnight Crossing Times. First of all, thank you for the further details on the plethora of markets one may be interested in estimating the model about. As crosses refer to markets business hours, it is understood the trickiness when comparisons involve variables registered/computed with respect to different solar days. A solution I can propose here is to introduce a convention and convert timestamps on a 24-hours scale to guarantee consistent comparisons, then re-index results based on original timestamps. Lastly, series aspects such as the timezone, are handled by users and I should clearly not interfere with them.

On how to proceed, I would postpone the outlier analysis in favor of data splitting to first of all check out parameters estimation trends under an augmented data set.

Corner cases

Although the author does not explicitly account for them, there could be many corner cases.

One example is security suspended trading and therefore market data absence, see comment on line 227. In cases this is temporary on the intraday level we should not experience any issue (intraday timestamps are deliberately left unconstrained), whereas in cases the security was suspended for many days in the period considered - or was delisted - it feels like it should be excluded from the analysis in the first place.
You also mentioned opening and closing auctions. About them I tend to @jaymon0703 's suggestion to follow Kissell and consider core trading sessions only. To motivate this I would like to add that auctions are reportedly made by Exchanges in order to account for demand-supply imbalances and have accurate pricing, they should be as a consequence already reflected in the actual opening quotes. Anyways, to my knowledge data provided for testing does not contain auctions.

jaymon0703 commented 5 years ago

Outlier Analysis

If we look at the reason Kissell specifies for excluding outliers, it is because it will skew the impact we are trying to model. From Kissell: "Filtering is commonly done on market impact data sets to avoid the effect of high price movement due to a force or market event that is not due to the buying or selling pressure of investors." In this sense, it makes sense to replicate Kissell's approach. Examples of volume outliers i can think of include "block trades" and off-book trades which are typically for significant trade sizes, and all of which are excluded from our test dataset but could contribute to excess market volume days. Examples of volatility outliers could include many things like a company being cited for accounting irregularities and subsequently losing most of its value in a matter of days or weeks. We saw this on Steinhoff in SA in Dec 2017. Of course the price movement would be a function of excessive selling pressure but is hardly a normal observation and one in which you likely want to exclude from your training dataset.

Related to the above and to the 1st point above in Corner Cases, other reasons for missing data can include any number of corporate actions (new listings, inward secondary listings, delistings, unbundlings, name changes, mergers etc) which should not preclude the analyst from incorporating the data into their model for measuring market impact since the data related to companies affected by these corporate actions is still equally valid data.

Giving the user the option to exclude outliers is a minimum requirement. If the functionality is provided, then the user can train the parameters with and without the outliers and come to their own conclusions as to the benefit of including/excluding them. Likely with a comparison of the error analysis on their OOS datasets, and against customer order data.

Data Splitting

On data splitting i was referring specifically to the computation for volatility when i mentioned measuring with prices from Morning Close to Morning Close and Afternoon Close to Afternoon Close. All other variables will be based on the data from the start of the Morning to the end of the Morning and likewise for Afternoon. So Total Volume, ADV, Imbalance are all "re-scaled".

Regarding midnight crossings, i think we are in agreement this can be left as a TODO considering it digresses from the approach in Kissell and introduces an element of scope creep into the project. Ideas you have for solving that problem can be included in the docs or as a comment for when work is carried out on that solution.

Corner Cases

Regarding your second point in Corner Cases, Kissell does not touch on how he handles auctions (nor block trades or off-book trades for that matter). For this reason i think it should be documented how we would treat them, and excluding them from market impact modeling makes sense to me as auctions are discrete events which can have significant intraday effects on prices and volume, often exacerbated by rebalances (MSCI, headline etc). Large block trades and off-book trades might get highlighted and excluded as outliers, but they also might not. Again, we should document how we would treat them. Ultimately and of course, what data the user ends up using for training their parameters is up to them.

vi-to commented 5 years ago

On the data-splitting just pushed, I should remark that Kissell only briefly mentions it in his "Data definitions". In brief, this is the above-mentioned re-scaling, mostly an adaptation of what the function computed previously on full trading days (which still represents the default behavior).

In the reference, such intraday "sessions" are not further detailed with respect to the rolling variables as these last ones are always considered in terms of day-by-day periods. The daily close-to-close seems to have its natural extension in the close-to-close by sessions. However, I shall remark two differences with respect to the volatility computed this way:

there seems to be a difference with what suggested by @jaymon0703 (morning close to morning close and likewise for the afternoon) and this may be due to his questions as an analyst. I would be happy to further work on it, but in the meanwhile I would like to point out that this variable is a factor of the instantaneous impact equation and should be comparable with the imbalance size;
the "volatility" was annualized since everything was on a daily basis, should the scale now take into account the number of sessions each trading day is split into ?

jaymon0703 commented 5 years ago

Comments related to the most recent commit https://github.com/braverock/blotter/pull/97/commits/eb89b390ff9b2479154dde68b802144a10ff07e5:

few observations:

In order to get 3 sessions i specified


sessions = c("T07:00:00/T10:59:59",
                 "T11:00:00/T15:00:00",
                 "T07:00:00/T15:00:00")


2. The function took 6mins to run for me. I used:

paramsBounds <- matrix(NA, nrow=5, ncol = 2) paramsBounds[1:5, 1] <- c(100, 0.1, 0.1, 0.1, 0.7) # 0 <= b_1 <= 1, 0.7 is an empirical value paramsBounds[1:5, 2] <- c(1000, 1, 1, 1, 1)

t1 <- Sys.time() test_istar <- iStarPostTrade(MktData, sessions = c("T07:00:00/T10:59:59", "T11:00:00/T15:00:00", "T07:00:00/T15:00:00"), paramsBounds = paramsBounds, horizon = 30)

test_istar <- iStarPostTrade(MktData, paramsBounds = paramsBounds, horizon = 30)

t2 <- Sys.time() t2-t1


3. We should warn users to set appropriate timezone for their R session. My input data is GMT+2 but changes to GMT during the function call as my Sys.timezone() defaults to "UTC". This matters for time subsetting and is therefore worth emphasising.

4. Getting a few NAs when computing MktValue...looks like the as.numeric() went missing in this commit.

5. I get an arrCostSamples error again...i get around it by commenting out the data grouping code from lines 398-418.

After applying my fixes for 4 and 5 above, the code runs to completion and i get the below output...which shows extra data giving us increased significance on the parameter estimates as expected, albeit no parameters inside the bounds.

coef(test_istar$nls.impact.fits$nls.fit.instImpact) a_1 a_2 a_3 100.0 0.1 1.0 summary(test_istar$nls.impact.fits$nls.fit.instImpact)

Formula: arrCost ~ a_1 (imbSize)^(a_2) (annualVol)^(a_3)

Parameters: Estimate Std. Error t value Pr(>|t|)
a_1 100.00000 4.77834 20.928 < 0.0000000000000002 a_2 0.10000 0.01798 5.562 0.000000027 a_3 1.00000 0.04573 21.866 < 0.0000000000000002 ***

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 133.1 on 24917 degrees of freedom

Algorithm "port", convergence message: relative convergence (4)

coef(test_istar$nls.impact.fits$nls.fit.mktImpact) a_4 b_1 0.1 0.7 summary(test_istar$nls.impact.fits$nls.fit.mktImpact)

Formula: arrCost ~ b_1 instImpact (POV)^(a_4) + (1L - b_1) * instImpact

Parameters: Estimate Std. Error t value Pr(>|t|) a_4 0.1000 0.1727 0.579 0.563 b_1 0.7000 1.2782 0.548 0.584

Residual standard error: 133.5 on 24918 degrees of freedom

Algorithm "port", convergence message: both X-convergence and relative convergence (5)



Before we get into data grouping and removing outliers, it may be worth using the function to predict market impact estimates on OOS data. I think this is what the sample rolling variables was for.

Lastly, i think next i am going to try running the model with wider parameter constraints and with no constraints at all. As it is getting late i will save that for this Sunday. That, and reviewing the rolling variable calcs specifically.

All in all good to see progress with data splitting. Thanks @vi-to!

vi-to commented 5 years ago

Thank you @jaymon0703. Below are some further observations and notes.

It is unclear to me why the third session is specified so to overlap with the other two. In this way some data-points are excluded and it would be fine, but some others are duplicated. Sessions are intended to be used to:
- Split a trading day. Considering JSE opening hours with respect to original testing data I have and excluding auctions (i.e. from 9:00 to 16:50, although market is open from 7:00 and the opening auction is 8:35-9:00 https://www.jse.co.za/grow-my-wealth/jse-auction-process), an example of sessions splitting is:
```
sessions = c('T09:00:00/T10:59:59', 'T11:00:00/T12:59:59', 'T13:00:00/T14:59:59', 'T15:00:00/T17:00:00')
```
- Exclude periods within a trading day. For example if your market data contains auctions or events you wish to exclude from the analysis, as for the 11-months dataset.
Agree, the function is slowing down as there are cumbersome additional pieces of work. I am about to share an ~~more efficient~~ *apply version of the data-splitting introduced with eb89b39.

Splitting efficiency: I do care about having efficient constructs. I have not carried formal benchmarking yet, but simple comparisons showed how the two constructs are pretty similar in terms of execution speed. They also seem comparable to me with respect to readability, although perhaps another loop looks deceptive there. Moreover, both relies on good arguments when it comes to consider ideas listed below. I would like to keep the latter (d2acb14) at the moment, so to move on and keep up with the deadlines.

We should also keep into account both the input data size (GB in RStudio) and its variety with respect to several dimensions, which in turn dictates the need for recursive operations all across. So far I have followed the "no additional dependencies" policy and the only significant speed improvement I see with the current structure is with the foreach package. I will leave here other approaches that may significantly speed-up the function, for my own future reference or to share it as a new open issue:
- With the current dependencies and structure, revising the for loops using specific vectorized functions or through *apply. However, there are time-based mechanisms that may lack clarity or become hard/unfeasible to unbundle under this paradigm. Furthermore, often overwriting is intended so that to avoid heavy objects.
- Add new dependencies so to potentially parallelize some bottlenecks. Again, this is not always feasible and needs scrutiny.

You are encouraged to suggest ideas and I will be keen on working on them or on what I may have overlooked.

Agree, I can recall myself going under the same timezone sudden mix months ago. Sys.setenv(TZ) did it. Timezones and their consistency should be left as users responsibility. Have you noticed functions or constructs that do change them?
'MktPrice' and 'MktQty' will shortly be guaranteed to be converted/reconverted as numeric type. However I recommend to refrain from taking non-numeric columns other than 'Reason' into MktData.
Yes, data grouping is a WIP. As its use into the model will be clarified start to end, it will follow an extensive revision of the corresponding code (this is another part perhaps the author leaves a bit in the shadow, will be among the very next steps to address). Commenting out that code chunk for now avoids stopping errors circumstances, but for sake of doing things the right way I will add a convenience param that may be nonetheless useful later.

braverock commented 5 years ago

on the lag and scaling: Kissell uses sqrt(250) everywhere for scaling volatility. This indicates that he is using daily close to close volatility, which means that his closes are at the same time each day.

250 isn't quite right for most markets, as most markets have slightly more or slightly fewer business days per year than 250. It is also true that scaling by the square root of time is not the greatest scaling mechanism in the world, but it is ubiquitous, and widely understood to be the 'correct' way of doing things. So deviating from the square root of time scaling would both be against what Kissell clearly does and against standard practice, even where better models exist.

No, on to close to close volatility. Close to Close volatility assumes the same closing time each day, e.g. noon, or 4pm, and takes the volatility of that approximately 24-hr gap. The weekend longer gap is presumed to be no different than the overnight/daily 24 hr gap gap for the purposes of simplicity (this is another simplifying assumption that more sophisticated models do not make, as weekend volatility is sometimes different from standard overnight volatility). If you have mismatched close times, the sqrt(250) scaling approximation above would also need to be adjusted to better account for the time not captured, or captured twice.

So my read of Kissell is pretty standard close to close volatility, even for his multiple sessions per day, the close to close volatility would be from close of session to close of the same session the: next day.

e.g. daily close to daily close morning close to morning close

There are obviously lots of different volatility models. I've given Jasen implementations of some of the standard ones using OHLC data, so it is possible some of that may be able to be included if time allows.

vi-to commented 5 years ago

As briefly mentioned, Kissell's "Data Definitions" discussion is based on day-by-day periods (T = 10, 22, 30, or 66 days). Here it is evident how the number of business days in a given year (and in a certain market) is taken as a scaling factor and this was implemented using sd.annualized, before the data-splitting mechanism addition (note that this is currently the fallback when one specifies a single session). When the data-splitting actually takes place, data-points increase and this appear to make the daily volatility somewhat insufficient.

The main concern I have on using a volatility from session-close to next-day-session-close is the look-ahead bias. Consider the instantaneous impact equation, we would be evaluating an imbalance size computed on sessions close-to-close and a volatility on session close to next day session close. Perhaps I am missing something, but it seems to me that this way the latter includes information not reflected in the per-session imbalances.

Furthermore, when volatility is computed as the standard daily close-to-close there is of course no guarantee that a security traded each and every day within the same timestamps (and at the same pace), i.e. that an exact 24-hrs period has passed. I understand there could be overnight (or sessions closing, so to speak) or even other effects an analyst may be interested in researching, but how would a "session close-to-close volatility" conceptually differ from the standard daily one with respect to pure timestamps arguments?

I agree with you, there is and will be still a fair amount of simplifications. But I believe the scaling factor to be the most important aspect to address in our modeling context so far.

braverock commented 5 years ago

you shouldn't have look ahead bias in either calculation @vi-to.

you are never using t+1 observations, always time t, and time t-1 (days in this case)

so the 'daily' sqrt(250) scaling still holds in that case.

braverock commented 5 years ago

see ?volatility in TTR.

It seems best to me to not reinvent the wheel here, since TTR is already a dependency of blotter, we should probably use the standard code instead of trying to rehash research that has been written about for decades.

vi-to commented 5 years ago

@braverock thank you for your remarks and for having pointed out a great function I was not aware of.

On the ongoing discussion, the proposed scale = yrBizdays * length(sessions) is an interpretation of what discussed in your sd.annualized() docs: "[...] To normalize standard deviation across multiple periods, we multiply by the square root of the number of periods we wish to calculate over. To annualize standard deviation, we multiply by the square root of the number of periods per year." When length(sessions) = 1, that is we extract only one session from each trading day, the number of periods considered in a year is equal to yrBizdays (the usual daily scaling factor, also used by Kissell in is his daily setting discussion). My doubt rather concerns the scaling factor when the splitting actually takes place, that is when we extract more than a single period from the same trading day and as a consequence we have a total of yrBizdays * length(sessions) periods in a given year. In this case, log-returns are not daily close-to-close and the volatility to be annualized is not from a daily basis. What I have in mind is:

sigma_annualized = sigma_session * sqrt(yrBizdays * length(sessions))

On using volatility() I would be interested, for example, in the Garman-Klass estimator or its Yang-Zhang modification. However, although there could be solid reasons to do so, this will result in taking a different approach than the authors' original model.

Apart from this, I shall at least colloquially express a couple of considerations which may be worth emphasizing in the vignette discussion to provide an idea. They are both based on how the time aggregation is thought to influence volatility fluctuations measurements. On one extreme, I am not concerned with potential longer horizons scaling issues (https://www.sas.upenn.edu/~fdiebold/papers/paper18/dsi.pdf). Indeed, this can be seen as out of scope with respect to the model (as we work with daily data at most), but is pointed out to give a sense of the influence the scaling factor can have. On the other extreme, the frequency sessions-based splitting will specify should be examined. Reportedly, at a high-frequency intraday level - to simplify, say minute by minute up until ~30mins periods - noise is likely to be present and when not accounted for it would produce biased volatility estimations. On the other hand, when the splitting consists of wider intraday periods, say hourly ones, we would likely not face such issues as the aggregation should suffice to attenuate noise.

In the authors' modeling context and illustrations thereof, these concepts looks fairly unexplored during my study of their model so far. The data-splitting is not discussed and the entire model discussion relies on daily horizons. Of course, motivations I provided above lacks mathematical formalism, proper quantifications and references I would certainly add thereafter. Doing so precisely in the next few days is perceived to be out of our project timeline, as you said there are decades of research on it. Surely enough, I am willing to further elaborate and work on them in the near future.

vi-to commented 5 years ago

Last commits added the data grouping step Kissell proposes and claims to use before getting into the nonlinear estimation. As argued in the docs update, I regard this step as having advantages as well as disadvantages; a great advantage is the assist it can provide to the nonlinear model subsequent estimation procedure in that groups-dependent "outliers" can be excluded, on the other hand sometimes a concerning aspect is the amount of data that can be excluded from it. The same arguments hold for the case the datapoints are determined as groups means. In both cases and regards, many other user-specified variables play a role: first of all the size of the initial dataset, then the sessions, horizon and minGroupDps. This is why the data grouping is not carried by default and is left as a user option.

Let's have a visual example that shows a comparison among different datapoints sets one may obtain, depending on the specifications. The original dataset is JSE 5-months on 40 stocks, using sessions = c('T09:00:00/T13:00:00', 'T13:00:01/T17:00:00') and others params being the same. Plots are produced with scatterplot3d::scatterplot3d.

datapoints-multiangle-comparison

Below plots are to give a better perspective on the volatility. Please note that the umber of datapoints is provided to give an idea, but it of course remains only an example.

By and large, it would be helpful for a user to have such visualizations at hand and I would be happy to include them in a plotting function with other plots we are working on. I was looking for an easy function to do exactly this. @braverock it is understood it would be better to have alternative solutions, but do you think it could be a feasible addition in the meanwhile?

jaymon0703 commented 5 years ago

Thanks Vito, looking really good so far! My comments from today's review:

I believe one of the intervals for consideration should include the interval with 0 lower bound. I see no reason to exclude these observations.
As discussed and for future refactoring and/or testing, i think expand.grid should be sufficient for building the vectors of buckets. It may simply mean slightly cleaner code, that's all.
For the grouped variables, my understanding of the text is Kissell simply averages the arrival cost but maintains each observation in the buckets such that the number of data points is not reduced, instead arrival cost is simply averaged for each observation in the bucket.
Lastly, and as a note for when i get to poking at this again, when using the below call i get an inconsistency with the length of arrCost and the other variables after the first grouping is complete.

paramsBounds <- matrix(NA, nrow=5, ncol = 2)
paramsBounds[1:5, 1] <- c(100, 0.1, 0.1, 0.1, 0.7) # 0 <= b_1 <= 1, 0.7 is an empirical value 
paramsBounds[1:5, 2] <- c(1000, 1, 1, 1, 1)

test_istar <- iStarPostTrade(MktData, sessions = c("T09:00:00/T13:00:00",
                                                   "T13:00:01/T17:00:00"), 
                             paramsBounds = paramsBounds, horizon = 30,
                             grouping = c(TRUE,TRUE))

vi-to commented 5 years ago

Thank you very much for reviewing and mentoring overall @jaymon0703 .

In my reading of Kissell, his provided sequences do not explicitly include zero bounds nor it is stated otherwise. Detailed explanations are omitted, would you please provide further motivations on why you think those points should be included?
I will provide reasons why excluding them might not completely come as a surprise:
- A simple conceptual reason is that datapoints in the region you are referring to are points not enough dispersed to uncover interesting statistical relations on the dataset. This also speaks to the "non-outlier" argument you said, although I believe it should be interpreted the other way around.
- A perhaps more convincing reason is about problems arising when fitting "zero-residual data", arguments too wide in scope to discuss here, see ?nls warning especially on the 'port' algorithm we are using to include Kissell's parameters bounds. Simplifying, we are estimating a rather complicated nonlinear model (the full one) and those datapoints clutter the * iterative procedure with potential convergence failures or global best fit ambiguities reflected in apparently "stable" parameters combinations.

However, looking at data as showed in plot above I am not even sure we have any point there. If you prefer to have some hints from what the original author does, check his plots and you will notice that no point that close to zero appears to be considered.

Even though the number of datapoints the author uses for the regression seems to be 180,000, around x120 times the grouped datapoints obtained in the example above. However, the stocks universe he considers is the S&P1500, much wider than ours. Furthermore it is not clear if any other datapoint is excluded in the grouping step, although this reasonably depends on the dataset at hand and should not invalidate the above.

Anyways, you are encouraged to specify groupsBrounds of your choice and report results if you want.

I look forward to simply that part and would be happy to hear alternative solutions. If you found a cleaner version please share it. On using expand.grid, the last solution proposed (61169bd) is easier than what previously shared and in line with original intentions.
I doubt this to be possible. We use datapoints for regression and what you propose implies having a LHS with less elements than the RHS. The arrival cost, as the explanatory variables, is averaged per bucket.
With the same call I obtain consistent results. Not sure why you have this problem, but clearly it should not happen.

Screenshot 2019-08-10 at 21 09 20

jaymon0703 commented 5 years ago

The only explicit exclusions in Kissell are for outliers and observations not making up 25 or more observations for a given bucket. An observation in the first bucket with a zero lower bound is still an observation. Nevertheless, upon subsequent testing perhaps either of us will be swayed one way or the other.
Sounds like you are making progress on a solution with expand.grid so i will hold off providing my suggestion as it needs to be tested. Nevertheless it would have started with something like:

expand.grid(list(imbBounds, volBounds, povBounds))

What i am proposing based on my interpretation of Kissell implies an equal number of elements for arrival cost and other variables, just that arrival cost will be identical in each bucket for each of the given bucket variables. My understanding is this will average away noise. Ultimately, testing the model with different interpretations can be carried out to understand what works and what may be Kissell's actual meaning.
The docs for type.convert() imply its main application is for read.table(), with additional params for dealing with different input data types. I have not figured out why converting to numeric from character yields an integer type for me, but secMktData <- sapply(secMktData[, colnames(secMktData) != 'Reason'], 'as.numeric') gives me the desired output.

General update on my testing so far with 11m data. With grouping we get param estimates on the bounds, with a1=100 and a2:b1 = 1. When b1=1 that implies 100% temporary impact. This feels unlikely in production, and whilst it can be constrained with a param bound <1 (say 0.95 for example) i would prefer if the model converged on a solution below 1 without my imposed constraint.

When testing on the full dataset i get a1=100, a2 = 0.2188, a3=1, a4=0.1 and b1=0.7. This feels like more stable and robust parameters, as a2 is inside the bounds, a4<1 implies a non-linear cost impact as a function of POV and b1<1 means temporary impact is not 100%.

braverock / blotter

Add Kissell-Malamut I-Star model #97

Outlier analysis

Data splitting (augmentation)

Outlier Filtering

Midnight Crossing Times

Outlier analysis

Data splitting (augmentation)

Corner cases

test_istar <- iStarPostTrade(MktData, paramsBounds = paramsBounds, horizon = 30)

Parameters: Estimate Std. Error t value Pr(>|t|) a_1 100.00000 4.77834 20.928 < 0.0000000000000002 a_2 0.10000 0.01798 5.562 0.000000027 a_3 1.00000 0.04573 21.866 < 0.0000000000000002 ***

Parameters: Estimate Std. Error t value Pr(>|t|)
a_1 100.00000 4.77834 20.928 < 0.0000000000000002 a_2 0.10000 0.01798 5.562 0.000000027 a_3 1.00000 0.04573 21.866 < 0.0000000000000002 ***