Chicago / food-inspections-evaluation

This repository contains the code to generate predictions of critical violations at food establishments in Chicago. It also contains the results of an evaluation of the effectiveness of those predictions.
http://chicago.github.io/food-inspections-evaluation/
Other
406 stars 130 forks source link

Facility for model comparison #64

Open fgregg opened 9 years ago

fgregg commented 9 years ago

When developing alternate model, this final script will facilitate evaluation of the models.

orborde commented 9 years ago

Looking at the code and whitepaper, it seems like the way you evaluated the model was by using the glm output to create an inspection "schedule" (a list of the order in which to conduct the inspections) and then analyzing how quickly the created schedule located the violations (as opposed to looking at the model confusion matrix or other traditional measures of model performance).

So an evaluation script should probably take the "schedule" as input and rerun the analyses in the white paper to compute some metrics. I'm planning on hacking one together today in the course of trying some other ML techniques on this dataset.

geneorama commented 9 years ago

@orborde you are exactly correct. The individual glm scores are used to sort the inspectors into a schedule, and that schedule is more important than the individual scores. I don't know of a way to directly optimize on the schedule performance. Hopefully optimizing the scores results in a better schedule. (btw, thanks for introducing the word schedule. That's a useful addition to the vocabulary of this project.)

geneorama commented 9 years ago

Sorry that this has been taking so long, I've been busy with a few other things.

Here's an update on what I'm thinking for the plan:

Refactor the 30 script to only "run the model"; specifically:

The plots and benchmarks should go to another report / file, which will have a more clear comparison.

I was thinking it would be nice to make a demonstration "31" file that has an alternative model, and an accompanying report that compares the results between 30 and 31. That way someone could just pick up from there and have a facility for comparison.

For the 31 demonstration file I was thinking it would be nice to simply have the past "average" value. This would be similar to how the baselines look in Kaggle competitions. Rather than having a "submission" the user could compile results in the report. To guard against overfitting, we would check that the results make sense on even more recent data.

This might be a separate issue, but perhaps it would be nice to publish the 40 prediction scripts. The model uses the food inspection history, but the prediction uses current business licenses as the basis. Ultimately the logic in the prediction script would be important for testing on new samples, especially if this is going to be an ongoing evaluation. @tomschenkjr - you may have some thoughts on this?

geneorama commented 5 years ago

@orborde or @fgregg Do you have any recommendations for best practices for model comparison?

As I mentioned in @orborde 's pull request, the format of the food inspection data has changed dramatically as of last year, and there is a need to reconsider the model.

orborde commented 5 years ago

I don't have any "best practices" in mind offhand. I do think that generating an inspection schedule and simulating to see how quickly that schedule finds violations, or how efficiently (in terms of number of violations per inspection performed), is a solid approach.

Note that you'll need to be careful not to directly evaluate your inspection schedule on the data used to train the model generating that schedule. See https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets

Beyond that, I don't know enough about your problem space to give you more specific advice. Let me know what you wind up trying, though!