Submission: Group 19 – Car Acceptability Predictor

Submitting authors: @ataciuk @louiewang820 @pengzh313 @lisaseq

Repository: https://github.com/UBC-MDS/Car-Acceptability-Prediction Report link: https://github.com/UBC-MDS/Car-Acceptability-Prediction/blob/main/doc/car_popularity_prediction_report.md Abstract/executive summary:

Our research question: Given attributes of car models from the 90s, can we predict how popular each car model will be?

Expecting a growing demand for used vehicles in our business, we are interested in building a classification machine learning model to predict the popularity of a given used cars. The training data was from a public available 1990s Car Evaluation dataset with 1728 observations, 6 categorical features and 1 categorical target. We performed data reading, data splitting, and Exploratory Data Analysis (EDA) in the Python environment. After that, OrdinalEncoder transformer was applied to pre-process the 6 features, then we were using four different scikit-learn classifiers, DummyClassifier, DecisionTreeClassifier, RandomForestClassifier, and MultinomialNB to conduct cross-validation with balanced_accuracy as the scoring metric. The result showed RandomForestClassifier had the best test score, so we performed hyperparameter optimization on the selected RandomForestClassifier. At the end of the analysis, our best optimized classifier has been applied on the test data. We received a optimistic test score with 0.965. Scripts to run the analysis were created with Docopt.

Editor: @flor14 Reviewer: Jakob Thoms, Chenyang Wang, Yingxin Song, Mike Guron

[ ] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: @wakesyracuse7

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[ ] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[ ] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[ ] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2 hours

Review Comments:

The proposal is easy to understand and well structured, and I like the "zero-emission" idea from the introduction. The green idea maybe it is the solution to many problems in the current world, so the question is meaningful.
Although it is not a big deal, the plots generated from EDA can be refined, such as including "figure 1" under the plot so the reader can clearly and quickly find the plot the text refers to and readjust the size of the plots to make it look better.
It is not a big deal, but the code style/format is a little different from script to script and I understand it is because they are written by different people, maybe it is better to modify a little bit using the same style guidelines.
I like the limitations and future improvements part of the report. Indeed, as it said in the introduction part, in the future, since the car will be finally zero-emission, people will be interested in used cars for collection. However this data set only consists of 1728 observations collected in the 90s, so it may be better to combine another data set consisting of cars from the 80s, and 70s for the purpose of the question of interest. Afterall, I think cars from the 80s and 70s are more suitable for the collection purpose.
All the file names, folder names, and function names are meaningful and easy to figure out the contents inside, and I appreciated the test functions which is good for reproduce purposes.

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @YXIN15

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[ ] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[ ] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2 hours

Review Comments:

Most of the scripts were very well-documented. The comments throughout the scripts cleanly divided them into different sections, so they were easy to follow and understand.
A very minor suggestion for reproducibility would be to include the code for creating and activating the conda environment in the Usage section as well.
This checklist asks for author affiliations in the final report, so perhaps include those. Also, please check over the grammar in the report, as there are some small errors throughout (run-on sentences, "public available dataset" instead of "publicly available dataset", changing from past tense to present tense and then back to past tense in the same paragraph, etc).
For Table 1, adding a column specifying which model each of the rows is associated with would greatly improve its readability. For Table 2, perhaps make the column headers more "human-readable" and round to 3 significant digits where needed. I also agree with the reviewer above about adding some Figure captions, and also try and center the figures in the report to improve its organization/readability.
Perhaps keep references out of the body of the report (i.e. "which is presented in B. Zupan, M. Bohanec, I. Bratko, J. Demsar: Machine learning by function decomposition. ICML-97, Nashville, TN. 1997.") and refer to it with an in-text citation while moving the full citation to the References section instead.
Figure 2 was a very interesting plot to include that had some great revelations about the dataset. The introduction of the report is well-researched and it's great how you guys established a background about a business endeavour and tied in some current events. It's clear that you guys have put a lot of thought into this project. Well done!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @J99thoms

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[ ] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[ ] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[ ] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5

Review Comments:

Your README has a nice layout.
You repeat yourself in this part of your README:

"Next we will perform some exploratory data analysis on the train dataset, by looking to see if any features contain any missing or null values, their respective data types, the split in our target categories (to watch out for any class imbalances).

We will then perform exploratory data analysis (EDA). We will review the distribution of our target class (to see if we have a class imbalance) and ... "
In the summary section of the report you talk about objects from sklearn such as OrdinalEncoder and MultinomialNB but never ever explicitly mention that sklearn is being used. It could be a good idea to add that somewhere.
Your repo is very well organized and easy to navigate.
Your instructions for reproducing your analysis worked perfectly for me.
Your code has lots of comments which makes it easy to read and understand quickly.
Your code is fairly tidy, but it does not adhere to style guides in all places. For example:

random_search = RandomizedSearchCV(pipe_rf, param_dist, cv=5, 
        n_iter=100,n_jobs=-1, verbose=1,scoring = "balanced_accuracy",random_state = 522)

and

final_model = make_pipeline(preprocessor,  RandomForestClassifier(random_state=522, 
        n_estimators= best_n_estimators, max_depth= best_max_depth))

This line of your report doesn't flow very well and I would consider rewording it:

" As we can see from Table 2, the test score has been further improved from 0.959 to 0.968, with the best hyperparameters (param_rfn_estimators equals to 68 and param_rfmax_depth equals to 12)."
Nice work overall!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[ ] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[ ] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Similar to the reviewer above, while doing my initial readthrough of your project I had also noticed that your README had a section where you repeated some details regarding the EDA (see below):

"The first thing we will do is split our data into test and training sets (with a 90% to 10% split). Next we will perform some exploratory data analysis on the train dataset, by looking to see if any features contain any missing or null values, their respective data types, the split in our target categories (to watch out for any class imbalances).

We will then perform exploratory data analysis (EDA). We will review the distribution of our target class (to see if we have a class imbalance)..."

While not a significant error, the README file is one of the first things that readers see and read when they view your project so it would aid in making a great first impression if this section were revised.

Upon reading your Team Code of Conduct it appears that some sections may have been meant to be formatted as a bullet point list; however, the different points all appear as one long sentence/paragraph (see below for an example), which hinders the readability of the code of conduct:

"Using welcoming and inclusive language Listening when others are talking No bad ideas! Considering others perspectives Being respectful of differing viewpoints and experiences Gracefully accepting constructive criticism Focusing on what is best for the team Showing empathy to others Apologizing respectfully Using each others' preferred pronouns"

Some of the EDA figures in the Results & Discussion section of the report could be developed further to adhere to proper data visualization guidelines and improve their readability. For example, Figure 1 could be rotated so that the longer variable names are show horizontally, instead of vertically, making them easier to read. Additionally, while it isn't a big issue, editing the axis labels in Figures 1 and 2 to add capitalization and remove underscores from variable names would make the figures more presentable. The addition of figure captions would be useful as well for future readers.
The classifier you developed looks to be performing quite well in it's current state, but the dataset and features used are somewhat limited. Perhaps if you plan on continuing to refine this classifier it may be worth looking into finding a larger dataset to see how well it generalizes across a larger pool of examples or experimenting with some feature engineering to improve the prediction scores (although they are already quite good).
Overall, very well done!! The project was interesting, easy to follow along with from start to finish, and the instructions provided to reproduce the analysis were very thorough. I also appreciated how organized you kept your repository and that most of the scripts were well documented with comments throughout to aid in readability and understanding of the code.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Thank you all for sharing your feedback. Please see the improvements that have been completed as per feedback:

Author names have been added in Data_processing.py script. link to commit
All Python dependencies versions have been clearly noted. link to commit R dependencies have been added. link to commit.
Proposal.md file has been added in the doc directory. link to commit
Target for all in makefile has been simplified. [link to commit]
Updated Rscript in ReadMe to render as default. Now figure captions showing properly in the output html file. link to commit

For a full list of improvements during the MileStone 4, see the UBC-MDS/Car-Acceptability-Prediction/issues/19

UBC-MDS / data-analysis-review-2022