UBC-MDS / data-analysis-review-2022

0 stars 1 forks source link

Submission: Group 19 – Car Acceptability Predictor #6

Open ataciuk opened 1 year ago

ataciuk commented 1 year ago

Submitting authors: @ataciuk @louiewang820 @pengzh313 @lisaseq

Repository: https://github.com/UBC-MDS/Car-Acceptability-Prediction Report link: https://github.com/UBC-MDS/Car-Acceptability-Prediction/blob/main/doc/car_popularity_prediction_report.md Abstract/executive summary:

Our research question: Given attributes of car models from the 90s, can we predict how popular each car model will be?

Expecting a growing demand for used vehicles in our business, we are interested in building a classification machine learning model to predict the popularity of a given used cars. The training data was from a public available 1990s Car Evaluation dataset with 1728 observations, 6 categorical features and 1 categorical target. We performed data reading, data splitting, and Exploratory Data Analysis (EDA) in the Python environment. After that, OrdinalEncoder transformer was applied to pre-process the 6 features, then we were using four different scikit-learn classifiers, DummyClassifier, DecisionTreeClassifier, RandomForestClassifier, and MultinomialNB to conduct cross-validation with balanced_accuracy as the scoring metric. The result showed RandomForestClassifier had the best test score, so we performed hyperparameter optimization on the selected RandomForestClassifier. At the end of the analysis, our best optimized classifier has been applied on the test data. We received a optimistic test score with 0.965. Scripts to run the analysis were created with Docopt.

Editor: @flor14 Reviewer: Jakob Thoms, Chenyang Wang, Yingxin Song, Mike Guron

wakesyracuse7 commented 1 year ago

Data analysis review checklist

Reviewer: @wakesyracuse7

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2 hours

Review Comments:

  1. The proposal is easy to understand and well structured, and I like the "zero-emission" idea from the introduction. The green idea maybe it is the solution to many problems in the current world, so the question is meaningful.

  2. Although it is not a big deal, the plots generated from EDA can be refined, such as including "figure 1" under the plot so the reader can clearly and quickly find the plot the text refers to and readjust the size of the plots to make it look better.

  3. It is not a big deal, but the code style/format is a little different from script to script and I understand it is because they are written by different people, maybe it is better to modify a little bit using the same style guidelines.

  4. I like the limitations and future improvements part of the report. Indeed, as it said in the introduction part, in the future, since the car will be finally zero-emission, people will be interested in used cars for collection. However this data set only consists of 1728 observations collected in the 90s, so it may be better to combine another data set consisting of cars from the 80s, and 70s for the purpose of the question of interest. Afterall, I think cars from the 80s and 70s are more suitable for the collection purpose.

  5. All the file names, folder names, and function names are meaningful and easy to figure out the contents inside, and I appreciated the test functions which is good for reproduce purposes.

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

YXIN15 commented 1 year ago

Data analysis review checklist

Reviewer: @YXIN15

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2 hours

Review Comments:

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

j99thoms commented 1 year ago

Data analysis review checklist

Reviewer: @J99thoms

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5

Review Comments:

random_search = RandomizedSearchCV(pipe_rf, param_dist, cv=5, 
        n_iter=100,n_jobs=-1, verbose=1,scoring = "balanced_accuracy",random_state = 522)

and

final_model = make_pipeline(preprocessor,  RandomForestClassifier(random_state=522, 
        n_estimators= best_n_estimators, max_depth= best_max_depth))

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

mikeguron commented 1 year ago

Data analysis review checklist

Reviewer:

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. Similar to the reviewer above, while doing my initial readthrough of your project I had also noticed that your README had a section where you repeated some details regarding the EDA (see below):

"The first thing we will do is split our data into test and training sets (with a 90% to 10% split). Next we will perform some exploratory data analysis on the train dataset, by looking to see if any features contain any missing or null values, their respective data types, the split in our target categories (to watch out for any class imbalances).

We will then perform exploratory data analysis (EDA). We will review the distribution of our target class (to see if we have a class imbalance)..."

While not a significant error, the README file is one of the first things that readers see and read when they view your project so it would aid in making a great first impression if this section were revised.

  1. Upon reading your Team Code of Conduct it appears that some sections may have been meant to be formatted as a bullet point list; however, the different points all appear as one long sentence/paragraph (see below for an example), which hinders the readability of the code of conduct:

"Using welcoming and inclusive language Listening when others are talking No bad ideas! Considering others perspectives Being respectful of differing viewpoints and experiences Gracefully accepting constructive criticism Focusing on what is best for the team Showing empathy to others Apologizing respectfully Using each others' preferred pronouns"

  1. Some of the EDA figures in the Results & Discussion section of the report could be developed further to adhere to proper data visualization guidelines and improve their readability. For example, Figure 1 could be rotated so that the longer variable names are show horizontally, instead of vertically, making them easier to read. Additionally, while it isn't a big issue, editing the axis labels in Figures 1 and 2 to add capitalization and remove underscores from variable names would make the figures more presentable. The addition of figure captions would be useful as well for future readers.

  2. The classifier you developed looks to be performing quite well in it's current state, but the dataset and features used are somewhat limited. Perhaps if you plan on continuing to refine this classifier it may be worth looking into finding a larger dataset to see how well it generalizes across a larger pool of examples or experimenting with some feature engineering to improve the prediction scores (although they are already quite good).

  3. Overall, very well done!! The project was interesting, easy to follow along with from start to finish, and the instructions provided to reproduce the analysis were very thorough. I also appreciated how organized you kept your repository and that most of the scripts were well documented with comments throughout to aid in readability and understanding of the code.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

pengzh313 commented 1 year ago

Thank you all for sharing your feedback. Please see the improvements that have been completed as per feedback:

  1. Author names have been added in Data_processing.py script. link to commit
  2. All Python dependencies versions have been clearly noted. link to commit R dependencies have been added. link to commit.
  3. Proposal.md file has been added in the doc directory. link to commit
  4. Target for all in makefile has been simplified. [link to commit]
  5. Updated Rscript in ReadMe to render as default. Now figure captions showing properly in the output html file. link to commit

For a full list of improvements during the MileStone 4, see the UBC-MDS/Car-Acceptability-Prediction/issues/19