How should the {si} package work? / Roadmap ideas

Robinlovelace commented 2 years ago

I'm looking for feedback from anyone with experience of SIMs in terms of:

[ ] How to not reinvent the wheel? Aim is for modelling functions in {spflow} and {gravity} and other packages to be easy to implement, with the function si_predict(). Aim to add these packages to Suggests and put examples implementing them into articles/vignettes.
[ ] What additional functionality would be most useful? Currently the main function is actually focussed on pre-processing with si_to_od() creating an 'analysis ready' (and modelling ready) data frame with all the variables from origins and destinations you could need.
- [ ] Functions like si_model_exponential_decay() and si_model_power() for quickly getting people started and not having to define their own functions
- [ ] Implementation of the radiation model, previously implemented in {stplanr} and in scikit-mobility
- [ ] More example datasets?
[x] Tidy or standard evaluation?
[ ] Anything else?

Currently (2022-04-22) the function used to predict interaction is called si_predict() and works like this:

https://github.com/Robinlovelace/si/blob/d9ae80e683b316d619f3a8843f2a7d138c7d3b1f/README.qmd#L40-L53

That is likely to change to a tidy-eval framework in #10.

Previous questions (now mostly answered) related to this:

[x] Should it be called si_predict(), perhaps with another function e.g. called si_train() to train models (constrained/unconstrained)?
- Yes, now implemented
[x] Should the first argument of the of the fun argument be an od object (I'm currently thinking not as that arg is already in si_model(), heads up @Nowosad)?
- [x] I don't think so, implemented in #10
[x] How should custom SI prediction functions, e.g. si_gravity() work? I'm thinking as simple as possible would be good, enabling commands such as si_predict(od, fun = si_gravity(m = origins_population, n = destinations_population, distance = distance_euclidean)) would be good
- Partially implemented in #10
[x] Related to the previous question, should we use tidy evaluation (currently is being used with var_p)?
- Implemented, now constraint_p
[x] More broadly which conventions should we follow in terms of symbols used for SIM equations, e.g. Wilson's 1979 paper uses w_1/w_2, while some more recent papers (e.g. Simini's 2012 paper) uses m/n, throughout?
- Going with notation in Dennett's 2018 paper

Nowosad commented 2 years ago

Hi @Robinlovelace -- I think you forgot to add an example of how this function works currently.

Robinlovelace commented 2 years ago

True that, updated, thanks for the heads-up @Nowosad and looking forward to doing some geocomputing with you soon!

adamdennett commented 2 years ago

Will try and add some proper thoughts when I'm back at work next week (or more likely the week after) but immediate thoughts on functionality that would be useful would be as well as various options to calibrate cost or distance / origin / destination parameters with observed data and Poisson / nb regression, functionality to input ones own parameter guesses (I.e. 1 for origin, - 1.5 for dis) would be really useful for rough and ready flow estimating.

TaylorOshan commented 2 years ago

One thing that might be useful to consider if incorporating constrained models estimated via GLMs is the use of sparse matrices to accommodate design matrices dominated by binary indicator variables. Unnecessary if instead using the multiplicative form, but could be nice to have both. Another consideration is metrics for evaluating predictions, such as comparing matrices, out-of-sample methods, SSI, etc.

Robinlovelace commented 2 years ago

Hi Taylor, thanks for the input. As per #14 and #15 I think the greatest 'added value' part of this approach could be the geographic pre-processing and flexibility for people to use whatever modelling frameworks they want as inputs into the si_calculate() (which takes hard-coded SIM functions) and si_predict() (which takes model objects as the first input) functions. I lack deep experience with SIMs and as such defer to the judgement of others re. that side of things and, to be honest, I don't 100% understand what design matrices dominated by binary indicator variables are, would that not be handled by the predictive model, e.g. glm() in base R or the nlsLM() function from the minpack.lm package as outlined in the introductory si vignette?

Agree re metrics for evaluating predictions, plan to discuss this with @lenkahas on Friday, although still need to get the foundations right e.g. #16 is the priority ATM.

TaylorOshan commented 2 years ago

That makes sense @Robinlovelace , very cool. In that case, you can leave the bespoke data structures up to the individual packages doing the calibration.

Apologies for the ambiguity, the design matrix here is the columns of input data used for the regression. In the case of the singly constrained model, a Poisson linear regression with fixed effects for the set of locations will generate the same coefficient estimates and predicted values as directly using nonlinear optimization (based on a multinomial distribution) for the multiplicative form from Wilson. The fixed effects from the Poisson linear regression is typically included into the design matrix using a binary indicator/dummy variable for each location, which causes the design matrix to become very sparse for even a moderate number of locations. Not an issue if you are using a different calibration technique, and as you mentioned here, it is more of a downstream issue with the function being supplied by the user.

Would be interested to hear yours and @lenkahas thoughts on metrics once you've had a chance to discuss!

Robinlovelace / simodels

How should the {si} package work? / Roadmap ideas #5