UBC-MDS / data-analysis-review-2022

0 stars 1 forks source link

Submission: Group 05: chocolate_rating #16

Open eyrexh opened 1 year ago

eyrexh commented 1 year ago

Submitting authors: @markusnam @robindhillon1 @eyrexh @LishaGG

Repository: https://github.com/UBC-MDS/chocolate_rating Report link: https://github.com/UBC-MDS/chocolate_rating/blob/main/doc/chocolate_rating.html Abstract/executive summary: Here we attempt to build a chocolate numeric rating prediction model by evaluating Support Vector Regression (SVR) and Ridge models on chocolate-related data such as manufacturers, country of bean origin, cocoa percentage and most memorable characteristics. Our best model (SVR) performs fairly well on an unseen test data set. The mean absolute percentage error (MAPE) of SVR is 7.99% compares with 8.22% of the Ridge model. From examining the coefficients generated from the Ridge model, we found that the “raspberry” flavour characteristic and “Fruition” chocolate manufacturer have the highest positive coefficients, while the “medicinal” and “chemical” flavour characteristics have the lowest negative coefficients. The data set used in this project was compiled by Brady Brelinsk Manhattan Chocolate Society, and can be sourced here. Each row in the data set represents an observation of a chocolate product with information like manufacturer, company location, review date, country of bean origin, specific bean origin or bar name, cocoa percent, ingredients, most memorable characteristics and rating.

Editor: @flor14 Reviewer: Spencer Gerlach, Austin Shih, Dhruvi Nishar, Alexander Taciiuk

spencergerlach commented 1 year ago

Data analysis review checklist

Reviewer: spencergerlach

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1

Review Comments:

Thinks I liked a lot:

Suggested Improvements:

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

austin-shih commented 1 year ago

Data analysis review checklist

Reviewer: austin-shih

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1 hour

Review Comments:

  1. Good concise introduction on the project and what is to be predicted. The use of a visual flowchart helps to further convey thought process and analysis steps. Would avoid using the term 'Golden Rule' unless explicitly defined. This term is only used in the context of this program's ML course and may be confusing to other users.
  2. Can go into a little more detail on why the MAPE score is used as the metric for predictions. Compare and contrast between other metrics and how they would affect results.
  3. Might be a good idea to include a small sample of the data set in the report, it would make the 'DATA' section of the report more intuitive and easier to follow. The preprocessing steps should be mentioned somewhere in the report as well to give the reader a better understanding on what features the project deems important.
  4. Very good use of visuals in presenting the results and comparing predictions from different models. One thing to note on using the MAPE score, the prediction scores may not be very intuitive for people without ML background as it provides an error score. 'Future Improvements' section gives very good suggestions for further development of this project.
  5. Over all very good project. The question statement is clear and gives adequate explanation on how to get to a result. There appears to be a lot more figures in the results folder which means a lot more insight can be added to the report.

Good job!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

ataciuk commented 1 year ago

Data analysis review checklist

Reviewer: @ataciuk

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Things that were good about the project:

  1. Big fan of the pipeline flow chart
  2. The repo is well organized
  3. The EDA was informative and useful at feeding into the model

Suggestions for improvement:

  1. Your research question should be more prominent. It is in the middle of the intro paragraph – it should be either the top or the bottom of that paragraph and ideally bolded or otherwise highlighted. The reader shouldn't have to work to find it.
  2. Your writing flips between first person and passive voice, I recommend using first person. "We did hyperparameter optimization via.." is more effective than the passive voice of "The hyperparameter optimization is done via...".
  3. The scripts could use fewer options. For example, the summary script could just use two options: output directory and input directory where files are stored. Then the script could call the specific file name in the main function. This would reduce the possibility of a mistake in the script
  4. I would add the report as a markdown file, not just the .html as these don't render properly on GitHub.
  5. In the dependencies section of the README, I would note that there is a conda environment in /src/for quick reference.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

dhruvinishar commented 1 year ago

Reviewer: dhruvinishar

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. I really like the project so far. The project workflow is very well designed and easy to recreate and reproduce the results of the analysis with very clear instructions on how to install and setup the analysis. The EDA is also very well designed and conveys the results in a very informative way which is easy to follow.
  2. The research question however, is not clearly defined and emphasised on in the report introduction and the findings of the data analysis.
  3. Adding tests for your code could also really help further improve the reproducibility of your code.
  4. I would have also liked to know more about why you chose to report the MAPE and what the MAPE scores ideally indicate about your analysis and the results you found. Interpreting the results with MAPE scores and what it means could help convey your results to a more general audience as well.
  5. Some scripts do not have code abstracted into more functions, for instance, the code in the main functions for model_svr.py andrating_eda.py could be abstracted into smaller functions instead of having it all in the main() function. This would help with a more structured way of styling the scripts.
  6. The project analysis report otherwise is very well documented, it includes all the required rubrics and the visuals are very helpful. The dataset you chose is also very interesting and the results were presented in an extremely clean and concise format.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

eyrexh commented 1 year ago