Submission: Group 18: newyork_restaurant_grading

Submitting authors: @snesunil, @yukunzGIT, @lzung, @nik11susan

Repository: https://github.com/UBC-MDS/newyork_restaurant_grading Report link: https://ubc-mds.github.io/newyork_restaurant_grading/doc/ny_rest_report.html Abstract/executive summary:

In this project, we build a classification model using logistic regression and support vector machines which uses health inspection data to predict whether a restaurant will be graded A (i.e., the restaurant is clean, up to code, and free of violations.) or F (i.e., the restaurant has some issues that must be fixed or is a public risk on the verge of closure).

Our best model was a balanced logistic regressor with a C value of 22.219381, 117 text features and 17 categorical features. On a test set of 10000 samples, we returned an F1 score of 0.999 and precision and recall scores of 0.999 and 0.999 respectively, indicating that our model is highly effective at classifying both grade A and F restaurants. We also computed the area under a receiver operating characteristic curve which was found to be 1.00. This is the optimum value which also supports that the predictions from our model are close to 100% correct.

We chose the dataset, DOHMH New York City Restaurant Inspection Results sourced from NYC OpenData Portal. It is retrieved from the tidytuesday repository by Thomas Mock, and can be sourced here. The original data set can be found here. It contains the violation citations from every inspection conducted for restaurants in New York City from 2012 to 2018. Each row represents a restaurant that has been assessed by a health inspector, including information about their business such as the restaurant name, phone number, location and type of cuisine, as well as the details about their inspection. The restaurants can be assigned an official grade of A, B, or C, otherwise they are assigned Z or P for pending review.

Editor: @flor14 Reviewer: Daniel Merigo Silos, Markus Nam, Andy Wang, Raul Aguilar Lopez

[ ] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: @markusnam (Markus Nam)

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelines: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 3 hours

Review Comments:

OVERALL: Well done. The visualization part is amazing - full of varieties. I like those plots a lot. Hope the model can further be developed so that it can identify Grade B and C, and can answer more interesting research questions in the future.

[General] I believe the high accuracy is due to the high correlation between feature scores and the target grade. Drilling into the raw data, as pointed out in the boxplot in the report, there are only very few cases in which a Grade A restaurant has a score >20. So intuitively, I would think score itself could already be a good predictor but I have a bit of reservation on whether scores should be used. It is similar to the case of predicting property price from a feature value per sq ft. It would be good to know the model performance without the feature scores.
[Final Report] Before and after the Research question section, only Grade A, B and C are mentioned but Grade F suddenly came up in the question. This could confuse readers. It would be better if you state explicitly that you are grouping Grade B and C into Grade F in the first place to avoid confusion.
[Final Report] In the Exploratory Data section, it mentions that 151,451 entries come with a filled value. But the number does not tie with the sum of the two numbers in the table. I understand that table is showing the data from the training set only. I would suggest moving the data splitting paragraph before Table 1.1, and state clearly that the numbers are from the whole data set, only the training set or only the test set to avoid confusion. Also you may put all the grade categories counts in your data file (i.e. A, B, C, P, Z and not yet graded) in the table for completeness and highlight only Grade A, B and C (B and C grouping into F) are your focus.
[Final Report] In the Exploratory Data section, there is a minor formatting issue for 300,000 (you typed 3,00,000).
[Final Report] In the Interpretation of the Results & Discussion section, based on the scores table, the balanced models perform better than their counterparts but it would be difficult to say they are "much better".
[Final Report] In the Interpretation of the Results & Discussion section, based on the scores table, it seems to me that svc_bal is indeed better than logreg_bal but you concluded that logreg_bal is the best.
[Final Report] Suggestions: (1) write your caption in the .Rmd level instead of generating it as part of the image; (2) use R code in .Rmd to read csv for tables instead of using images. Benefits: (a) more flexible if you want to move the tables / images around; (b) smaller file size.
[Other minor points] (1) In README, you may warn user that step (7) may take a couple of minutes to run. I understand that a warning is printed in terminal but other subsequent system warnings push that message up in less than 1 sec; (2) You may want to check the existence of output folder for test_df.csv as well in pre_process_nyc_rest.py.
[Future direction] (1) As mentioned in point 1, you may take out the feature scores and rerun your analysis; (2) Feature selection could be something that you may explore as well; (3) You may also examine the coef_ from your model as something interesting may be found there.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 hour

Review Comments:

OVERALL: Great work on EDA part. The plot can show where target values fall in a distribution. The result on your model is also very convincing and I hope you will get more time working on feature engineering to improve your model.

You only have a "environment.yml" file but you should include a dependencies section which can show list of dependencies in the README.
There are some inconsistent in the introduction section and confused me. You mentioned '“good” or “poor” (in our case, Grade A vs.Grade B/C)' but you are trying to classify Grade A vs Grade F as you mentioned in Research question. I see you grouped Grade B and C to Grade F in the pre-processing but you should mention it to avoid confusion.
I would like to see feature importance in your training model since it can determine which features attribute the most to the predictive power of your model. Also, I suggested to create a correlation plot to see how each feature correlated to the target value. The reason is it is very clear the score feature can tell how each resturant will be graded. Moreover, in your research question, you mentioned some interesting sub-questions. In order to answer them, I think you need to find out feature importance and feature correlation and play around it in future.
In the result section, you decide to choose the balanced logistic regression to train model since it is the best but it seems un-balanced logistic regression prefrom much better than the balanced one.
In the future work section, you can try to find the coeff and correlation for each feature and do some feature engineering to find the answer you listed in research questions section.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: AguilarRaul (Raul Aguilar)

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 3 hours

Review Comments:

GENERAL COMMENTS: I really enjoyed being able to read and learn from your project. The fact that it addresses such a recent topic that had as much impact as the restrictions caused by the COVID-19 pandemic made it very attractive. Furthermore, your EDA tells a story which is great. Regarding the project structure, everything is well organized and complete, it is a great example for the other teams, including mine, congratulations.

[Results] Regarding the charts, tables contain captions but some of the graphics don't. Perhaps it would be worth reviewing them to standardize so all charts will be self-explanatory.
[Report, introduction and research question] there is an inconsistency in the definition of the classifications for restaurants, perhaps it would be worth explaining why the same three-level classification that is used during the health inspection was not used (A, B and C). Also, the definition of two levels is changed from A-B/C to A-F but it is not explained why (until the EDA). This is just a small detail that is worth checking to avoid confusion.
[Report, introduction] In the introduction it is mentioned that the objective is to classify restaurants as good or bad according to their health and protection measures from COVID-19, however, at the end there is a reference to restaurants quality which can be interpreted as de quality of the food itself. It would be worth paraphrasing a bit to avoid confusion, I refer to the last line of the introduction.
[Report, EDA Table 1.1] The sum of the levels does not add up to 151,451 this may be due to the observations with P or Z values but it is not clear.
[Report, Interpretation] It is mentioned that the data was downsampled to reduce training time, it would be good to briefly mention which method was used and how this does not generate any bias in the model.
[Report, hyper-parameter optimization] It seems that the score improvement is marginal after hyper parameter optimization, it would be nice to compare fit and score times with the previous model to understand if it is worth using this new model or if this configuration has an unwanted effect such as fit/score time increase.
[Report, Interpretation] Agree that draws attention that the model classifies both true positives and true negatives so well. I don't think it's necessary to do more feature engineering but to explore why the model is classifying this good. I advise going back to review the correlation between variables and carry feature importance analysis.
[Report] There is no follow up for some of the interesting sub-questions, maybe is a good idea to list this as part of the Statement of Future Directions.

Amazing job!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: DMerigo (Daniel Aurelio Merigo Silos)

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5

Review Comments:

[General Comments] Very nice scripts! The code is well done, ordered, commented and there are enough tests. I really liked them and will add more tests to mine inspired by your diligence. The topic is interesting, specially "post" COVID. The EDA is informative and interesting with a good driving thread and adequate plots.

[Final Report] I liked the introduction and the justification for the project is valid. The only comment I would make is that the Grading you are doing is not apparent in the title, reading the rest of the document it appears that the grading refers to Health and Sanitation but I would change the title of the project to something more explicit to make it clear from the start. I would also add a small paragraph explaining what a violation entails (I understand as you mention that the threshold for a critical violation is not clear but a small explanation of "violation" would help understand the reasoning behind a Grade, specially for people outside the USA)

[Plots and Tables] The plots are well presented and explained but the tables tend to have odd formatting in the titles and seem squished between blocks of text, a little space could help them seem more aesthetic and concise titles would help as well. The Table 1.1 shows the count of grades for A and F grade yet it does not sum to 151, 451, are the missing values the restaurants with grades B, C? If they are that should be explicitly shown to avoid confusion and make it more reproducible and enhanceable (in case someone wants to create a model capable of predicting all grades)

[Modelling] When you mention a downsamplig what do you mean? How many samples are you using in the actual modelling process? Do you have a comparison between training time between the original available data nd the downsampled dataset to justify your downsampling?

You mention that the best model is a balanced logistic regression yet the validation scores favour the SVC balanced model in almost every instance. I understand that the difference lies in 0.01 values but it is still against your conclusion

[Conclusions] The conclusions seem in-line with your findings (mind the observation about the best model) and hyperparameter optimization seems to have no further improvement. I noticed that there is no mention of the subquestions posed in the introduction, is that something for the future or just a thinking exercise?\

[Final Word] Well done! I really liked the project (specially the well done and tested scripts and the thorough EDA). There are some comments regarding the explanations in the final report that are just to make the information clearer and broaden the appeal of the experiment. Congratulations for your efforts!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Five pieces of feedback that have been implemented:

(Milestone 2) download_csv does not have author name and date - Commit URL : https://github.com/UBC-MDS/newyork_restaurant_grading/commit/67a09a3d649789ee820937c4f297ed02526cc4d2 File Changed : src/download_csv.py
(Milestone 2) table caption too small - Commit URL : https://github.com/UBC-MDS/newyork_restaurant_grading/commit/0db5cc1059e8b2c2462070cdf0e9154b225cd383 File Changed : src/nyc_rest_analysis
(Milestone 2) remove ipykernel from environment yaml file Commit URL : https://github.com/UBC-MDS/newyork_restaurant_grading/commit/274499ba33f28729e04c1f25995aaa5990275da9 File Changed : environment.yaml
(Milestone 3) make clean does not work because the spacing the wrong.
Commit URL : https://github.com/UBC-MDS/newyork_restaurant_grading/commit/d8cf231c4a0652dc32a5106204de7d7a127c7691 File Changed : Makefile
(Milestone 3) Missing documentation for all and clean.
Commit URL : https://github.com/UBC-MDS/newyork_restaurant_grading/commit/d8cf231c4a0652dc32a5106204de7d7a127c7691 File Changed : Makefile

UBC-MDS / data-analysis-review-2022