Facility for model comparison

fgregg commented 9 years ago

Have current model output predictions in csv
Write naive model script that also outputs predictions in csv
Write script to intake prediction csvs and report model performance on test sample

When developing alternate model, this final script will facilitate evaluation of the models.

orborde commented 9 years ago

Looking at the code and whitepaper, it seems like the way you evaluated the model was by using the glm output to create an inspection "schedule" (a list of the order in which to conduct the inspections) and then analyzing how quickly the created schedule located the violations (as opposed to looking at the model confusion matrix or other traditional measures of model performance).

So an evaluation script should probably take the "schedule" as input and rerun the analyses in the white paper to compute some metrics. I'm planning on hacking one together today in the course of trying some other ML techniques on this dataset.

geneorama commented 9 years ago

@orborde you are exactly correct. The individual glm scores are used to sort the inspectors into a schedule, and that schedule is more important than the individual scores. I don't know of a way to directly optimize on the schedule performance. Hopefully optimizing the scores results in a better schedule. (btw, thanks for introducing the word schedule. That's a useful addition to the vocabulary of this project.)

geneorama commented 9 years ago

Sorry that this has been taking so long, I've been busy with a few other things.

Here's an update on what I'm thinking for the plan:

Refactor the 30 script to only "run the model"; specifically:

Import pre-calculated features and raw data
Put the data into a form that works for the model
- Convert to proper class (e.g. matrix / numeric)
- Manage factors (currently with model.matrix)
Create test / train index
Run model
Calculate prediction (test and train)
Save prediction

The plots and benchmarks should go to another report / file, which will have a more clear comparison.

I was thinking it would be nice to make a demonstration "31" file that has an alternative model, and an accompanying report that compares the results between 30 and 31. That way someone could just pick up from there and have a facility for comparison.

For the 31 demonstration file I was thinking it would be nice to simply have the past "average" value. This would be similar to how the baselines look in Kaggle competitions. Rather than having a "submission" the user could compile results in the report. To guard against overfitting, we would check that the results make sense on even more recent data.

This might be a separate issue, but perhaps it would be nice to publish the 40 prediction scripts. The model uses the food inspection history, but the prediction uses current business licenses as the basis. Ultimately the logic in the prediction script would be important for testing on new samples, especially if this is going to be an ongoing evaluation. @tomschenkjr - you may have some thoughts on this?

geneorama commented 5 years ago

@orborde or @fgregg Do you have any recommendations for best practices for model comparison?

As I mentioned in @orborde 's pull request, the format of the food inspection data has changed dramatically as of last year, and there is a need to reconsider the model.

orborde commented 5 years ago

I don't have any "best practices" in mind offhand. I do think that generating an inspection schedule and simulating to see how quickly that schedule finds violations, or how efficiently (in terms of number of violations per inspection performed), is a solid approach.

Note that you'll need to be careful not to directly evaluate your inspection schedule on the data used to train the model generating that schedule. See https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets

Beyond that, I don't know enough about your problem space to give you more specific advice. Let me know what you wind up trying, though!

Chicago / food-inspections-evaluation

Facility for model comparison #64