Chicago / food-inspections-evaluation

This repository contains the code to generate predictions of critical violations at food establishments in Chicago. It also contains the results of an evaluation of the effectiveness of those predictions.
http://chicago.github.io/food-inspections-evaluation/
Other
411 stars 130 forks source link

Train/test data includes schools, hospitals, and other facility types #106

Open vingkan opened 5 years ago

vingkan commented 5 years ago

According to the paper, inspections of hospitals and schools should not be included in the model train/test data. However, cross-referencing the model data with food inspection records from the Chicago data portal suggests that the model train/test data includes many different facility types, including hospitals and schools.

You can reproduce my Jupyter notebook that checks the facility types by launching it in Binder:

Binder

@tomschenkjr pointed out that there are at least two locations in the code that should filter out other types of facilities:

This still leaves 1003 inspections with facility type listed as "Other". After cross-referencing with the data portal, 994 inspections appear to be facilities other than restaurants or grocery stores.

There also appear to be 11 inspections in the model train/test data that did not have a facility type in their record from the data portal query. Here is an excerpt from my query showing how I filtered the data portal records (SoQL):

--
WHERE inspection_type = "Canvass"
AND inspection_date >= "2011-01-01"
AND inspection_date <= "2014-11-01"
--
geneorama commented 5 years ago

I get an error when I launch the "binder" link above.

Is there a discrepancy between how food establishments are classified in the data portal and how they are classified in their business license?

We join information about the business license at the time of inspection to the record of the inspection. We then filter the records to retain only "Retail Food Establishment" records.

As you noticed, a lot of business types (like schools and hospitals) are subject to food inspections. It's important to note that businesses have many license types. For example, some have liquor licenses alongside their retail food license, and others do not.

As you noticed, we only use inspections that have an associated business license description of "Retail Food Establishment".

As far as the "other" license types you're noticing, perhaps you're not looking at the licenses at the time of inspection? It could be that they dropped their food related license(s). For example, maybe it's a book shop that once also served / sold food, but now just sells books.

It's quite possible that you've found something, and I'll take a deeper look when we refactor the code, which should be happening in the next few months. The filtering is a little messy, and I think that this is something which will be fixed in the upcoming edits.

vingkan commented 5 years ago

Hi @geneorama, I have updated the Binder link above. Here it is again:

Binder

It may take a while to load. In case there are still issues, here is a copy of the notebook.

geneorama commented 5 years ago

Sorry, had a hard time following the Python and wasn't working on this project. Now that I'm back in it, I think I see what's going on.

We filter the business licenses were LICENSE_DESCRIPTION is Retail Food Establishment. Then we also use information about the facility_type which comes from food inspection data.

My understanding is that these are places that serve prepared food. However we do a lot of inspections in other places that sell packaged food or have kitchens.

I think that some of these retail food places are selling prepared foods in places like grocery stores. We do model the inspection of that prepared food, but we do not model the inspection of the packaged food, which is a separate license.

As I'm working on 2.0 I want to dig into this and be sure of the assumptions, so I'm glad you asked. The first time we did this I relied very heavily on prior art, but this time I want to understand it a bit more.

Before my talk at UseR! 2016, I performed some analysis to see what kinds of places are being inspected to get a list of all licenses that are inspected. As I recall, it wasn't as simple as I had hoped, and I couldn't find a clear cut rule for "this is a place that would get inspected". The best regex I found was searching for these terms in the license description "Retail Food|Consumption|Caterer|Food|Child". Then I grouped them together. My final count looked like this:

LICENSE_DESCRIPTION N
Retail Food Establishment 10910
Incidental Activity 2139
Wholesale Food Establishment 545
Caterer 192
Shared Kitchen 205
Mobile Food License 75
Children’s Services Facility License 817
Special Events 31

image

This is old data, I'm not sure how it would hold up with new license designations. Digging into that now.