UBC-MDS / data-analysis-review-2021

1 stars 4 forks source link

Submission: Group 13: Predicting Wine Quality #20

Open NikitaShymberg opened 2 years ago

NikitaShymberg commented 2 years ago

Submitting authors: @NikitaShymberg @gutermanyair @aldojasb @SonQBChau Repository: https://github.com/UBC-MDS/predicting_wine_quality Report link: https://github.com/UBC-MDS/predicting_wine_quality/blob/main/doc/Quality_white_wine_predictor.pdf Abstract/executive summary:

This report uses the white wine database from "vinho verde" to predict the quality based on physicochemical properties. Quality is a subjective measure, given by the average grade of three experts.

Before starting the predictions, the report performs an exploratory data analysis (EDA) to look for features that may provide good prediction results, and also makes an short explanation about the metrics used in the models. In data preparation, the database are downloaded and processed in python. In this phase, the training and testing sets are created and they will be used during the model building.

There's a brief explanation of the models used in this report. Other important machine learning concepts, such as ensemble and cross validation, are also discussed.

The results section presents the best model for predicting quality and discuss why it was chosen for this purpose.

Editor: @flor14 Reviewer:

thayeylolu commented 2 years ago

Data analysis review checklist

Reviewer: @thayeylolu

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1 hour 20 mins

Review Comments:

(Note: This comments are time-based)

  1. I will suggest using back ticks to enclose code snippets of the usage section in the Readme.md to improve readability.
  2. I commend you on the documentation (doc string of your functions). It looks neat. I noticed that the split.py isn't check to verify that your expected input file is a .csv file. What happens when it is not a .csv file. I would suggest you write a check (a try and except ) to catch unexpected cases and inform the user to use a .csv file.
  3. I suggest listing the dependencies in the readme.md
  4. I noticed there is no check to verify that the .csv file has the expected column names as used in your analysis. I will suggest writing a check to verify the input csv has the expected data column.
  5. I will like to suggest including a title for the last three plots in the eda.ipynb.
  6. I suggest showing the relationship between features as well in your eda. Perhaps an heatmap.

Analysis report

  1. I will like to suggest listing the authors of the report.
  2. Your project tells us what it aims to do, but it does not have a research question.
  3. I will suggest including figure captions
  4. I will like to suggest including a reference to a paper that talks about quality of wine. I could not find it in your reference.
  5. It also does not tell us background information or the importance of the research question about this study. What materials have attempted similar study?

Kind remarks: There may be typos here 😄 .

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

voremargot commented 2 years ago

Data analysis review checklist

Reviewer: @vorem

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1hr

Review Comments:

OVERALL COMMENTS:

  1. Overall, I really like your project. You did a great job at describing how the problem you were solving was important and how your findings could be utilized. The quality of wine did not seem important to me, but you made it very clear how your work would be impactful to the industry.

  2. The final report would be stronger if it included more specific details. When reading, it felt like the modeling results were overlooked which is one of the most interesting part of your project. There were a lot of general statements about the models and methods being used but what I would have found more interesting were details about which tuned parameters were the best, how all the model scores compare, and what specifics you decided on for data cleaning. I was not convinced your final model was the best because I had no data to tell me it was better than other models you tried. Remember that your reader likely has a background with data analysis methods and they are going to your report to look for specific on what was done and what you found.

  3. It would be good to check over your rendered final report. A bunch of the figures looked like they were cut off and it would be helpful if you added captions. Also, you worked so hard on this so make sure you put your names in the report!

  4. I really liked how the ReadMe was set up with clear sections. It made the document clear and easy to follow. In the usage section, adding some style formatting might make it clearer as to what is code and what are comments.

  5. You clearly did a lot of work on the model. The model script is very easy to follow and shows all the hyperparameter tuning that went into finding the best model for the problem. It would be really nice to see a summary of this work in the final report. I think it would make it clearer that the model you chose was the best. You put a ton of work into your model so it would be nice to showcase it more in the report!

  6. Adding more specifics to your summary would be benifical. As the reader I want to immediatly know some specifics about your EDA, hyperparameter tuning and most importantly, your results. At least in the scientific community, it is expected that reading the summary/abstract should tell you a condensed version of what was done, what the findings were, and what conclusions you came to.

SMALL ERRORS I NOTICED:

  1. In the documentation for EDA.py it would be helpful to mention specific figures are being outputted so the reader doesn't have to search through the code to find out.
  2. The preprocess script does not have any documentation or usage instructions in it.
  3. You have 4 files in your raw data folder but you only mention two of them in the report/readme. What are the others?
  4. The figures need captions in your report.
  5. Figures are being cutoff in the PDF report and are missing in the markdown file
nobbynguyen commented 2 years ago

Data analysis review checklist

Reviewer: @nobbynguyen

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2 hours

Review Comments:

Overall, I enjoy reading your interesting data analysis. I am impressed by how you challenged yourself to work with different algorithms to perform the task. However, in my opinion, there are still rooms for improvements as follows:

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

katerinkus commented 2 years ago

Data analysis review checklist

Reviewer: @katerinkus

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Could not complete this part

Estimated hours spent reviewing: 1.75h (changed from 1.25)

Review Comments:

README and folder organization

Replicating the project

The report

Side note regarding Make

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

NikitaShymberg commented 2 years ago

Thank you all for the feedback!

  1. Regarding the comments about missing authors here here and here we have added a list of authors to the report in this commit
  2. Regarding this comment about needing more info about training the model, we added this info in this commit
  3. Regarding this comment, we added a heatmap to our EDA in this commit
  4. Regarding this comment, we proofread the document and fixed the grammar and spelling errors in this commit
  5. Regarding this comment we added usage instructions in this commit
  6. Regarding this comment, we added figure captions in this commit