UBC-MDS / data-analysis-review-2021

1 stars 4 forks source link

Submission: Group 6: Ramen Quality Classification #24

Open Anthea98 opened 3 years ago

Anthea98 commented 3 years ago

Submitting authors: @datallurgy @shyan0903 @Anthea98 @PANDASANG1231

Repository: https://github.com/PANDASANG1231/522_Ramen Report link: https://github.com/PANDASANG1231/522_Ramen/blob/main/doc/report.html **Abstract/executive summary: In this project, we explored the world of instant noodles, aka ramen with a dataset containing over 2500 reviews on all kinds of instant noodles. The main problem we try to solve is to find what features are important for predicting a ramen’s rating. We used the OneHotEncoder(), CountVector() to transform the data. With the logistic regression model, we finally get an AUC score of 0.722 on the test dataset and summarize the top 5 good features and also top 5 bad features in our report. This is not a big question, but it is a good start of figuring out a result in real-life problems with data science for us. Considering the usefulness of this model for food lovers around the world when choosing nearby ramen restaurants, we think this is a very interesting and meaningful question.

Editor: @datallurgy @shyan0903 @Anthea98 @PANDASANG1231 Reviewer: Li_Dongxiao, Kaur_Karanpreet, Casoli_Jordan, MORPHY_Adam

karanpreetkaur commented 2 years ago

Data analysis review checklist

Reviewer: @karanpreetkaur

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hours

Review Comments:

  1. In the introduction section, your problem statement states that you are interested in important features for predicting ramen's rating. The statement seems incomplete as I see you're doing prediction as well so you should mention it in the problem statement that you will be reporting important features and will carry out prediction using only those features. I observed you are converting the target into classes but it would good to define in your problem statement that if your prediction problem is under classification or regression problem.

  2. In methods section, I see below contradicting statements for class imbalance. The class distribution for the two classes is 0.7 vs. 0.3, which is reasonable and not a concern for class imbalance. We use the logistic regression model for the prediction problem. We use five fold cross validation to assess the model. Since we have class imbalance, we use AUC as the scoring metric.

    If you have identified that your dataset has class imbalance, how are you addressing to handle it ?

    • You can use Logistic regression with class_weight = 'balanced' for this.
  3. If you have class imbalance, then reporting recall, precision, F1 score and Average Precision score would make more sense over AUC score as these metrics changes with class distribution.

  4. Also, there is no mention of predicted results in the report. You should address the size of test set and your model performance. It would nice to report them using a confusion matrix which clearly shows the number of misclassifications by your model.

  5. There are a couple minor grammatical and formatting errors (Fig 3. caption alignment) in the report that could be improved upon, but otherwise an interesting project

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

dol23asuka commented 2 years ago

Data analysis review checklist

Reviewer:

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1 hour

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. In your README.md, you did not specify the instructions on how to install and activate the environment needed for doing the whole process. So when I tried to reproduce your work using makefile, it throws an error to me saying that "No module named 'sklearn'". It would be better to emphasize and provide instructions in the README.md for clarity.

  2. The naming of your files can be improved better, for example, your final report is just named as 'final_report' without anything else. Might be better to include your project names. Your repo name on github can be more detailed as well if possible.

  3. The way you mentioned to handle class imbalance is not really clear. "Since we have class imbalance, we use AUC as the scoring metric. " Could be better to elaborate more on AUC and why exactly you want to use it given class imbalance, which can make more sense to readers who are not really familiar. Also, the method and result sections are quite simple and short. You may discuss more details like rationales behind. Since you are doing a binary classification, it would be better to report your prediction results using a confusion matrix which clearly shows the number of misclassifications by your model and explain your TP, TN, FP, FM values and how like your AUC is calculated by the prediction results.

  4. The quality of your plot and how your plots are displayed can be improved in the final report html, like make them with higher resolution and align them better for less visual noise. Also, the reference parts are not enough, since you are using so many packages to do this model fitting and prediction, you should clearly cite those tools and packages used.

  5. It will be better to also render a md file in your makefile for the final report because html is not accessible through github. It seems like you only have html file rendered.

  6. In your report, it might be better to discuss about your future direction like how you will address the limitations of your research in the future with better data and methods.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

adammorphy commented 2 years ago

Data analysis review checklist

Reviewer: @adammorphy

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2

Review Comments:

python src/download_data.py --url=https://www.theramenrater.com/wp-content/uploads/2021/09/The-Big-List-All-reviews-up-to-3950.xlsx --out_file=../data/raw/ramen_ratings.csv

While it was possible to reproduce, it was not easy and took some time without providing the command lines.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

jcasoli commented 2 years ago

Data analysis review checklist

Reviewer: @jcasoli

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

1.5 hours

Review Comments:

Overall I thought your report was really well done. I also have a key takeaway which is to look for Samyang Foods ramen!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

shyan0903 commented 2 years ago

Hi,

Thank you for all your helpful reviews! We have looked at all of them and have decided to incorporate these changes in our project:

  1. Responding to comments on report's naming and missing authors, we have changed it to "ramen_ratings_report" and added all authors. See commit https://github.com/PANDASANG1231/522_Ramen/commit/b8d5e66fa576357b59225c61c1d40cfe53a6009c

  2. Responding to comments on reproducibility, we have edited the usage section. See commit: https://github.com/PANDASANG1231/522_Ramen/commit/7954246160b9477b94d4512b0850a1fe3d2e5e46

  3. Responding to comments on adding confusion matrix for the test data set, please refer to the commit: https://github.com/PANDASANG1231/522_Ramen/commit/4621dbab6433bcb9b2c94733fb8c5f04771985f3

  4. Responding to EDA plots, we have added eda for top_ten and the target variable and explained the intuition in the report. See commit: https://github.com/PANDASANG1231/522_Ramen/commit/ff6bf9ae8340a3cdb8593c8e0b9be466986519e6

Thank you again for your time and kind reviews!