Open Anthea98 opened 3 years ago
In the introduction section, your problem statement states that you are interested in important features for predicting ramen's rating. The statement seems incomplete as I see you're doing prediction as well so you should mention it in the problem statement that you will be reporting important features and will carry out prediction using only those features. I observed you are converting the target into classes but it would good to define in your problem statement that if your prediction problem is under classification or regression problem.
In methods section, I see below contradicting statements for class imbalance.
The class distribution for the two classes is 0.7 vs. 0.3, which is reasonable and not a concern for class imbalance. We use the logistic regression model for the prediction problem. We use five fold cross validation to assess the model. Since we have class imbalance, we use AUC as the scoring metric
.
If you have identified that your dataset has class imbalance, how are you addressing to handle it ?
If you have class imbalance, then reporting recall, precision, F1 score and Average Precision score would make more sense over AUC score as these metrics changes with class distribution.
Also, there is no mention of predicted results in the report. You should address the size of test set and your model performance. It would nice to report them using a confusion matrix which clearly shows the number of misclassifications by your model.
There are a couple minor grammatical and formatting errors (Fig 3. caption alignment) in the report that could be improved upon, but otherwise an interesting project
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
In your README.md, you did not specify the instructions on how to install and activate the environment needed for doing the whole process. So when I tried to reproduce your work using makefile, it throws an error to me saying that "No module named 'sklearn'". It would be better to emphasize and provide instructions in the README.md for clarity.
The naming of your files can be improved better, for example, your final report is just named as 'final_report' without anything else. Might be better to include your project names. Your repo name on github can be more detailed as well if possible.
The way you mentioned to handle class imbalance is not really clear. "Since we have class imbalance, we use AUC as the scoring metric. " Could be better to elaborate more on AUC and why exactly you want to use it given class imbalance, which can make more sense to readers who are not really familiar. Also, the method and result sections are quite simple and short. You may discuss more details like rationales behind. Since you are doing a binary classification, it would be better to report your prediction results using a confusion matrix which clearly shows the number of misclassifications by your model and explain your TP, TN, FP, FM values and how like your AUC is calculated by the prediction results.
The quality of your plot and how your plots are displayed can be improved in the final report html, like make them with higher resolution and align them better for less visual noise. Also, the reference parts are not enough, since you are using so many packages to do this model fitting and prediction, you should clearly cite those tools and packages used.
It will be better to also render a md
file in your makefile for the final report because html is not accessible through github. It seems like you only have html file rendered.
In your report, it might be better to discuss about your future direction like how you will address the limitations of your research in the future with better data and methods.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
python src/download_data.py --url=https://www.theramenrater.com/wp-content/uploads/2021/09/The-Big-List-All-reviews-up-to-3950.xlsx --out_file=../data/raw/ramen_ratings.csv
While it was possible to reproduce, it was not easy and took some time without providing the command lines.
There is no exact link to the data source, as the given link does not connect to the data, which is an inconvenience for reproducibility. It should be https://www.theramenrater.com/wp-content/uploads/2021/09/The-Big-List-All-reviews-up-to-3950.xlsx
The report itself did not include the authors names, as they were only in the README file.
The usage for the EDA script does not match the script itself. Its called generate_EDA_figures.py
, but the usage states create_EDA_figures.py
In-text citation of the dependencies and software you used should be done in the report, for example, using SKLearn packages, Pandas, or other used for you visualisations like the word map.
You first state that there is "not a concern for class imbalance", then follow this with "since we have class imbalance, we use AUC as the scoring metric", which is a contradiction.
Fig. 3 could be centered as it is the only figure which is not. In addition, while there is a clear attempt in making the writing engaging, there are also some grammatical errors which could be improved, for example "It look like most ramens are quite tasty". Most of these are minor English mistakes and could be quickly fixed.
Since it is the main topic of you paper, it may be useful to elaborate more on the coefficients and state how exactly the important features are impacting the classification.
While the report specifies feature importance as the primary research question, the prediction results would also be useful to interpret in a confusion matrix. It is useful for the reader to understand, of the classifications made by the model, how many TP, FP, TN, and FN did the model make? Precision, recall, or f1 scores might also be useful in this case.
The AUC score was never interpreted beyond "AUC score of 0.722 on the test data, which is good enough for a simple model like ours". You could include a sentence to describe what the score actually means, and why this score is suitable for your analysis (i.e. did you test other models through hyper parameter optimization or feature select and were unable to improve the score?).
This was derived from the JOSE review checklist and the ROpenSci review checklist.
1.5 hours
In the "Methods" section of your report, you describe using 5 fold cross validation to assess the model, but I wasn't actually able to find any cross validation inside your source code. Maybe have a quick look? If I somehow missed it, be careful as you are doing your full data pre-processing/transformation and then passing transformed data into train_model.py. This has the potential to break the golden rule during cross validation.
While the figures you chose to display in the EDA section were interesting, I'm not sure how they helped to steer the direction of your analysis. For example, for the "ramen package style" categorical variable, you could show a boxplot where the ramen package style is on the x-axis, and average rating is on the y-axis. This would give some early indication as to whether ramen package style is an important feature for predicting rating.
In the methods section you mention dropping the top ten
feature. It might be helpful to give a quick explanation as to what the feature is, and why you decided to drop it.
I was able to get all your code to run. Nice! One thing that would probably make people's lives easier is if you actually listed the terminal commands in text format rather than just in the image/diagram. That said, the diagram was really helpful for me to quickly understand how the whole pipeline works!
Even though the primary goal was to be able to identify important features, I think presenting some additional figures in the results section would really add to your report! For example, you could show a confusion matrix for how your model performed on the test data. Or now that we've heard about the SHAP library you could show some really cool figures that better communicate feature importance!
Overall I thought your report was really well done. I also have a key takeaway which is to look for Samyang Foods ramen!
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Hi,
Thank you for all your helpful reviews! We have looked at all of them and have decided to incorporate these changes in our project:
Responding to comments on report's naming and missing authors, we have changed it to "ramen_ratings_report" and added all authors. See commit https://github.com/PANDASANG1231/522_Ramen/commit/b8d5e66fa576357b59225c61c1d40cfe53a6009c
Responding to comments on reproducibility, we have edited the usage section. See commit: https://github.com/PANDASANG1231/522_Ramen/commit/7954246160b9477b94d4512b0850a1fe3d2e5e46
Responding to comments on adding confusion matrix for the test data set, please refer to the commit: https://github.com/PANDASANG1231/522_Ramen/commit/4621dbab6433bcb9b2c94733fb8c5f04771985f3
Responding to EDA plots, we have added eda for top_ten
and the target variable and explained the intuition in the report. See commit: https://github.com/PANDASANG1231/522_Ramen/commit/ff6bf9ae8340a3cdb8593c8e0b9be466986519e6
Thank you again for your time and kind reviews!
Submitting authors: @datallurgy @shyan0903 @Anthea98 @PANDASANG1231
Repository: https://github.com/PANDASANG1231/522_Ramen Report link: https://github.com/PANDASANG1231/522_Ramen/blob/main/doc/report.html **Abstract/executive summary: In this project, we explored the world of instant noodles, aka ramen with a dataset containing over 2500 reviews on all kinds of instant noodles. The main problem we try to solve is to find what features are important for predicting a ramen’s rating. We used the OneHotEncoder(), CountVector() to transform the data. With the logistic regression model, we finally get an AUC score of 0.722 on the test dataset and summarize the top 5 good features and also top 5 bad features in our report. This is not a big question, but it is a good start of figuring out a result in real-life problems with data science for us. Considering the usefulness of this model for food lovers around the world when choosing nearby ramen restaurants, we think this is a very interesting and meaningful question.
Editor: @datallurgy @shyan0903 @Anthea98 @PANDASANG1231 Reviewer: Li_Dongxiao, Kaur_Karanpreet, Casoli_Jordan, MORPHY_Adam