Robinlovelace / simodels

https://robinlovelace.github.io/simodels
GNU Affero General Public License v3.0
15 stars 4 forks source link

How should the {si} package work? / Roadmap ideas #5

Open Robinlovelace opened 2 years ago

Robinlovelace commented 2 years ago

I'm looking for feedback from anyone with experience of SIMs in terms of:

Currently (2022-04-22) the function used to predict interaction is called si_predict() and works like this:

https://github.com/Robinlovelace/si/blob/d9ae80e683b316d619f3a8843f2a7d138c7d3b1f/README.qmd#L40-L53

That is likely to change to a tidy-eval framework in #10.

Previous questions (now mostly answered) related to this:

Nowosad commented 2 years ago

Hi @Robinlovelace -- I think you forgot to add an example of how this function works currently.

Robinlovelace commented 2 years ago

True that, updated, thanks for the heads-up @Nowosad and looking forward to doing some geocomputing with you soon!

adamdennett commented 2 years ago

Will try and add some proper thoughts when I'm back at work next week (or more likely the week after) but immediate thoughts on functionality that would be useful would be as well as various options to calibrate cost or distance / origin / destination parameters with observed data and Poisson / nb regression, functionality to input ones own parameter guesses (I.e. 1 for origin, - 1.5 for dis) would be really useful for rough and ready flow estimating.

TaylorOshan commented 2 years ago

One thing that might be useful to consider if incorporating constrained models estimated via GLMs is the use of sparse matrices to accommodate design matrices dominated by binary indicator variables. Unnecessary if instead using the multiplicative form, but could be nice to have both. Another consideration is metrics for evaluating predictions, such as comparing matrices, out-of-sample methods, SSI, etc.

Robinlovelace commented 2 years ago

Hi Taylor, thanks for the input. As per #14 and #15 I think the greatest 'added value' part of this approach could be the geographic pre-processing and flexibility for people to use whatever modelling frameworks they want as inputs into the si_calculate() (which takes hard-coded SIM functions) and si_predict() (which takes model objects as the first input) functions. I lack deep experience with SIMs and as such defer to the judgement of others re. that side of things and, to be honest, I don't 100% understand what design matrices dominated by binary indicator variables are, would that not be handled by the predictive model, e.g. glm() in base R or the nlsLM() function from the minpack.lm package as outlined in the introductory si vignette?

Agree re metrics for evaluating predictions, plan to discuss this with @lenkahas on Friday, although still need to get the foundations right e.g. #16 is the priority ATM.

TaylorOshan commented 2 years ago

That makes sense @Robinlovelace , very cool. In that case, you can leave the bespoke data structures up to the individual packages doing the calibration.

Apologies for the ambiguity, the design matrix here is the columns of input data used for the regression. In the case of the singly constrained model, a Poisson linear regression with fixed effects for the set of locations will generate the same coefficient estimates and predicted values as directly using nonlinear optimization (based on a multinomial distribution) for the multiplicative form from Wilson. The fixed effects from the Poisson linear regression is typically included into the design matrix using a binary indicator/dummy variable for each location, which causes the design matrix to become very sparse for even a moderate number of locations. Not an issue if you are using a different calibration technique, and as you mentioned here, it is more of a downstream issue with the function being supplied by the user.

Would be interested to hear yours and @lenkahas thoughts on metrics once you've had a chance to discuss!