Closed orborde closed 5 years ago
William,
Sorry that it's taken me a while to reply. I appreciate your suggestions and you raise some good points. I hope I don't sound negative, but I wanted to be direct in my reply for the sake of clarity.
I would prefer to skip the changes to 00_Startup.R
because, as you noted, I don't want to load every library at the start of the project. I only want the libraries that load libraries. Afterall, there are more libraries needed for various reports and the knitr document, which might be the only thing that some people are using.
I can't remember exactly why Rcpp
is an exception and gets loaded in startup, but I'm pretty sure there was a good reason; it's probably because that installation is a little more fundamental than other packages, and/or Rcpp
is a dependency for several packages.
First, each script in this project is supposed to be independent, and only depend on stored data objects and functions in the project. Every object within a script is created within that script. Sorry, but that's a key (but unstated) part of the design of the project. For me to go down your proposed route, I would want the 29 script to be a function, not a script.
Second, this script does not conform to the philosophy of the 10, 20, 30 organization pattern. The process of variable selection is part of the model process, and belongs in the model step (the 30 step in this case). So, the construction of the xmat belongs in this script. At first the concept of 29 "finalizer" step makes sense, but I think you'd run into problems if you had different models that use different inputs. For example, maybe you'd have a 31 that was a new model, and a 32 that was a new model with 3 new variables also.
However, I'm not even entirely following my own rules! The 30 script is doing too much right now. It's messy and there's exploratory code in there. This is very related to issue #64. After conversations with others I (and hopefully we) have reached the conclusion that it would be better if the 30 step just ran the model and saved the prediction, and then the prediction was compared in a separate place.
Maybe that comparison should happen in a script, as you have worked up, but I was thinking of putting the comparison in to an .Rmd file and having it as a report. Another idea (inspired by your work) would be to create a "calculate lift" function that we could run at the end of the 30 script, or in a report. I like the lift calculation.
As an aside, I've been thinking that it would be even better to make this evaluation code even more like a package that can be universally used on other similar problems. Currently, several field names are hard-coded into the functions, and functions are specific to this project.
I think it's unlikely at this point that I'll work on this PR further. Sorry! You should either close it or merge it as-is.
vffvvffe
@orborde I really appreciate your contribution.
I agree with your sentiment, but I don't think this it's worth it to accept this pull request at this time.
The format for the food inspections has fundamentally changed, and we are working to revamp the prediction model.
Given the fundamental changes that are needed, I think this is also a good time to consider other frameworks. Please email us if you have ideas at datascience@cityofchicago.org
This batch of changes is oriented around the creation of a new script, 40_evaluate_schedule.R, which evaluates a proposed inspection pattern's performance versus the "business as usual" benchmark. It accepts the proposed inspection pattern as a CSV input; read the notes at the top of the new script for more details.
Currently, the only evaluation actually performed is the "Percentage of period 1 & 2 critical violations found in Period 1" comparison found in the white paper; I plan to port the other evaluations later, once this pull request has been accepted and therefore approved the general architectural approach.
This pull request also introduces a new script, 29_init_with_data.R, whose purpose is to be source()'d by scripts to load the dataset into a useful collection of variables. 40_evaluate_schedule.R uses this new script to load the "train" and "test" sets in order to run its comparison. The intent is that this new script will also be used to dump the "train" and "test" sets to CSV files to make prototyping future models easier, but this is not yet implemented.
30_glmnet_model.R has been modified to (1) load data by source()ing 29_init_with_data.R, and (2) to generate an output CSV file readable by 40_evaluate_schedule.R.
(Note: this pull request includes the changes from pull request #72; the expectation is that you will merge #72 before accepting this one.)