Open ataciuk opened 1 year ago
The proposal is easy to understand and well structured, and I like the "zero-emission" idea from the introduction. The green idea maybe it is the solution to many problems in the current world, so the question is meaningful.
Although it is not a big deal, the plots generated from EDA can be refined, such as including "figure 1" under the plot so the reader can clearly and quickly find the plot the text refers to and readjust the size of the plots to make it look better.
It is not a big deal, but the code style/format is a little different from script to script and I understand it is because they are written by different people, maybe it is better to modify a little bit using the same style guidelines.
I like the limitations and future improvements part of the report. Indeed, as it said in the introduction part, in the future, since the car will be finally zero-emission, people will be interested in used cars for collection. However this data set only consists of 1728 observations collected in the 90s, so it may be better to combine another data set consisting of cars from the 80s, and 70s for the purpose of the question of interest. Afterall, I think cars from the 80s and 70s are more suitable for the collection purpose.
All the file names, folder names, and function names are meaningful and easy to figure out the contents inside, and I appreciated the test functions which is good for reproduce purposes.
Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Your README has a nice layout.
You repeat yourself in this part of your README:
"Next we will perform some exploratory data analysis on the train dataset, by looking to see if any features contain any missing or null values, their respective data types, the split in our target categories (to watch out for any class imbalances).
We will then perform exploratory data analysis (EDA). We will review the distribution of our target class (to see if we have a class imbalance) and ... "
In the summary section of the report you talk about objects from sklearn
such as OrdinalEncoder
and MultinomialNB
but never ever explicitly mention that sklearn
is being used. It could be a good idea to add that somewhere.
Your repo is very well organized and easy to navigate.
Your instructions for reproducing your analysis worked perfectly for me.
Your code has lots of comments which makes it easy to read and understand quickly.
Your code is fairly tidy, but it does not adhere to style guides in all places. For example:
random_search = RandomizedSearchCV(pipe_rf, param_dist, cv=5,
n_iter=100,n_jobs=-1, verbose=1,scoring = "balanced_accuracy",random_state = 522)
and
final_model = make_pipeline(preprocessor, RandomForestClassifier(random_state=522,
n_estimators= best_n_estimators, max_depth= best_max_depth))
This line of your report doesn't flow very well and I would consider rewording it:
" As we can see from Table 2, the test score has been further improved from 0.959 to 0.968, with the best hyperparameters (param_rfn_estimators equals to 68 and param_rfmax_depth equals to 12)."
Nice work overall!
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
"The first thing we will do is split our data into test and training sets (with a 90% to 10% split). Next we will perform some exploratory data analysis on the train dataset, by looking to see if any features contain any missing or null values, their respective data types, the split in our target categories (to watch out for any class imbalances).
We will then perform exploratory data analysis (EDA). We will review the distribution of our target class (to see if we have a class imbalance)..."
While not a significant error, the README file is one of the first things that readers see and read when they view your project so it would aid in making a great first impression if this section were revised.
"Using welcoming and inclusive language Listening when others are talking No bad ideas! Considering others perspectives Being respectful of differing viewpoints and experiences Gracefully accepting constructive criticism Focusing on what is best for the team Showing empathy to others Apologizing respectfully Using each others' preferred pronouns"
Some of the EDA figures in the Results & Discussion section of the report could be developed further to adhere to proper data visualization guidelines and improve their readability. For example, Figure 1 could be rotated so that the longer variable names are show horizontally, instead of vertically, making them easier to read. Additionally, while it isn't a big issue, editing the axis labels in Figures 1 and 2 to add capitalization and remove underscores from variable names would make the figures more presentable. The addition of figure captions would be useful as well for future readers.
The classifier you developed looks to be performing quite well in it's current state, but the dataset and features used are somewhat limited. Perhaps if you plan on continuing to refine this classifier it may be worth looking into finding a larger dataset to see how well it generalizes across a larger pool of examples or experimenting with some feature engineering to improve the prediction scores (although they are already quite good).
Overall, very well done!! The project was interesting, easy to follow along with from start to finish, and the instructions provided to reproduce the analysis were very thorough. I also appreciated how organized you kept your repository and that most of the scripts were well documented with comments throughout to aid in readability and understanding of the code.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Thank you all for sharing your feedback. Please see the improvements that have been completed as per feedback:
For a full list of improvements during the MileStone 4, see the UBC-MDS/Car-Acceptability-Prediction/issues/19
Submitting authors: @ataciuk @louiewang820 @pengzh313 @lisaseq
Repository: https://github.com/UBC-MDS/Car-Acceptability-Prediction Report link: https://github.com/UBC-MDS/Car-Acceptability-Prediction/blob/main/doc/car_popularity_prediction_report.md Abstract/executive summary:
Our research question: Given attributes of car models from the 90s, can we predict how popular each car model will be?
Expecting a growing demand for used vehicles in our business, we are interested in building a classification machine learning model to predict the popularity of a given used cars. The training data was from a public available 1990s Car Evaluation dataset with 1728 observations, 6 categorical features and 1 categorical target. We performed data reading, data splitting, and Exploratory Data Analysis (EDA) in the Python environment. After that, OrdinalEncoder transformer was applied to pre-process the 6 features, then we were using four different scikit-learn classifiers, DummyClassifier, DecisionTreeClassifier, RandomForestClassifier, and MultinomialNB to conduct cross-validation with balanced_accuracy as the scoring metric. The result showed RandomForestClassifier had the best test score, so we performed hyperparameter optimization on the selected RandomForestClassifier. At the end of the analysis, our best optimized classifier has been applied on the test data. We received a optimistic test score with 0.965. Scripts to run the analysis were created with Docopt.
Editor: @flor14 Reviewer: Jakob Thoms, Chenyang Wang, Yingxin Song, Mike Guron