UBC-MDS / data-analysis-review-2022

0 stars 1 forks source link

Submission: Group 10: Board Game Rating Predictor #21

Open xFiveRivers opened 1 year ago

xFiveRivers commented 1 year ago

Submitting authors: @xFiveRivers @erictsai1208 @marianagyby @ashwin2507

Repository: https://github.com/UBC-MDS/boardgame_rating_predictor Report link: https://github.com/UBC-MDS/boardgame_rating_predictor/blob/main/doc/boardgame_rating_predictor_report.Rmd Abstract/executive summary: Developing a board game is no easy feat. There are many parameters and dimensions of a game that define not only the overall game, but also the enjoyment of the user. Because of the high-dimensionality of board games, it is difficult to understand what needs to be incorporated in the process of development to create a good product. The goal of our project is to develop a model that can predict user scores from 1 to 10 based on a number of features and combinations to reduce the workload of trying to understand what to incorporate during game development.

Editor: @flor14 Reviewer: Suraporn Puangpanbut, Chester Wang, Natalie Cho, Lauren Zung

Suraporn commented 1 year ago

Data analysis review checklist

Reviewer: @Suraporn

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: ~1h

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. Great project and teamwork!, I enjoyed reading trough the project, the project is well organized with appropriate folders and meaningful names, it is easy to follow along the readme and the report. One thing the author might consider to add is about the background of boardgame, I have no background in boardgame so it is a bit difficult for me to understand the it, adding background of boardgame can help non-boardgame people understand it easier.

  2. I like that you have the test function in each source code, it will certainly help when user call the function with wrong inputs. Some good test examples such as try-except in download_data.py, assert statement in 'preprocess_boardgame_data.py', also raise ValueError in eda_boardgame.py.

  3. Unfortunately, I can not reproduce the environment since the commnad "conda env -f envboard.yaml" does not work. I think the author needs to use conda env export -f envboard.yaml or conda env export --from-history -f envboard.yaml to export conda environment. Then using conda env create --file envboard.yaml to create the environment from exported yaml file. I also would like to suggest the author to actually clone the repository, create the environment from provided yalm file, and test all codes to ensure that it is reproducible as well.

  4. The line conda env -f envboard.yaml and 'conda activate envboard' together in Usage section of readme file make it difficult to copy, paste, and run it right away in command line. The author better put it in separated lines, so it is easy to copy and run in terminal command.

  5. I can not render boardgame_rating_predictor_report.html. It seems like some codes there are wrong. I would like the author to recheck the report and make it possible to render in Github repository. It will be good if this html report can render properly, so the reader can read the summary report in this file rather than raw file format like boardgame_rating_predictor_report.Rmd.

Finally, congratulations to the authors on successfully building a data analysis project from a scratch to finish in such a short time. Well done!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Natalie-cho commented 1 year ago

Data analysis review checklist

Reviewer:

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: ~1h

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. Proposal and analysis is thorough and properly explained.
  2. Code for EDA and model creation were clear and easy to read.
  3. The code to create the environment produced an error (missing an argument to specify creation of an env from a file).
  4. Although mentioned in the description within the report, your scatterplot showing actual vs predicted scores could benefit from a legend explicitly stating which line portrayed what information.
  5. Although not a major problem, your files may benefit from being all the same type to have consistency (the EDA was an .ipynb file using python and the final report created as .Rmd file)

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

ChesterAiGo commented 1 year ago

Data analysis review checklist

Reviewer: @ChesterAiGo (Chester Wang)

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2 hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. The instructions look very clear and easy to follow. Having that said I encountered same issues as mentioned by other reviewers above when reproducing it.
  2. The repo is a bit too large and takes time to download. Personally I would recommend use .gitignore to avoid uploading the datasets and users can just download the data using the downloader scripts.
  3. The code is of good quality in general and is well documented. There are a few scripts that can be further improved (e.g. in eda_boardgame.py maybe use a instead, and in prediction_model.py maybe split the main() function into more parts for better readability. )
  4. The repo is well structured in general and the names are consistent. Nice work!
  5. Extensive and in-depth EDA! I love the visualization part in particular, as the figures are just so clear and intuitive

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

lzung commented 1 year ago

Data analysis review checklist

Reviewer: @lzung (Lauren Zung)

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2 hours

Review Comments:

Overall comments: Each of your scripts are well-documented in my opinion (the formatting of the modelling script especially - nice comments throughout). I think your tests are also relevant additions to each script. The report is engaging and contains well-formatted figures and a result table. Really great work overall!

  1. As mentioned by others, the command to install the environment should be conda env create -f envboard.yaml, but this is a minor typo. Since the environment file has been added to your repository, users can just clone the repo, create and activate the environment from the directory, rather than downloading it separately in steps 1 and 2 under Usage. Unfortunately, I could not install the environment on my system (MacOS) so it might be worth mentioning what operating system(s) can replicate your analysis and/or include system-specific packages that might be needed. ResolvePackageNotFound:

    • docopt-ng==0.8.1
    • vl_convert
  2. I noticed that in step 5 under your Usage section, you suggest running the same script twice to save data from two csvs - perhaps you could include another argument in your download script that will read the data from both links with only one command? This is just a suggestion, but it could streamline the analysis since it seems that both of these files are meant to be processed at the same time. Similarly, there is no mention of the command to run the EDA even though you have a script prepared, so that would be nice to include.

  3. I think the use of some of the helper functions in eda_boardgame.py is slightly unnecessary, since many of them are not actually used more than once in the script (plot_rating_distribution, plot_numeric_feature_distribution, plot_top_10_categories, etc.). I think that if the code was kept inside main(), it could be easier to follow but this is mostly up to preference. If you do want to streamline main(), I would suggest saving the charts directly in the helpers instead of creating intermediate objects (i.e. return save_chart(rating_plot, out_dir + "rating_distribution.png")). In a similar scope, I noticed that plot_top_10_mechanics() and plot_top_10_categories() are pretty similar with only a few settings changed, so you could use a helper here for both grouped bar charts and set some parameters to change depending on whether you're plotting the mechanics or categories.

  4. This is just a question that came to mind, but it seems that you guys are transforming the categorical features with literal_eval in both the EDA and prediction steps. I'm not entirely sure what this does, but maybe this step could be done during your data preprocessing stage (preprocess_boardgame_data.py) so that exported data is already suitable for plotting/analysis?

  5. Along with the Ridge and RandomForestRegressor models, it would be cool to see if your performance improves with the new regularization techniques we have covered in 573 (adding lasso or elastic net regression!) Similarly, since you chose $R^2$ as your scoring metric, perhaps MAPE or RMSE could be a nice option to show alongside that is interpretable relative to your target, user rating.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

marianagyby commented 1 year ago

Thank you to @Suraporn, @ChesterAiGo, @lzung, and @Natalie-cho for all your helpful feedback! All your feedback points were valuable and encouraging but could not all be addressed due to time constraints. Here are a few points highlighting the comments we addressed:

  1. Addressing comment by @lzung: "In a similar scope, I noticed that plot_top_10_mechanics() and plot_top_10_categories() are pretty similar with only a few settings changed, so you could use a helper here for both grouped bar charts and set some parameters to change depending on whether you're plotting the mechanics or categories."

    • We refactored the plot code accordingly through multiple commits: PR link
  2. Addressing comment by @lzung: "Similarly, since you chose as your scoring metric, perhaps MAPE or RMSE could be a nice option to show alongside that is interpretable relative to your target, user rating."

  1. Addressing comment by @Suraporn: "I can not render boardgame_rating_predictor_report.html. It seems like some codes there are wrong. I would like the author to recheck the report and make it possible to render in Github repository. It will be good if this html report can render properly, so the reader can read the summary report in this file rather than raw file format like boardgame_rating_predictor_report.Rmd."
  1. Addressing comment by @Natalie-cho and others: "The code to create the environment produced an error (missing an argument to specify creation of an env from a file)."

    • We fixed the typo in read me environment command: Commit link
  2. Addressing comment by @ChesterAiGo: "There are a few scripts that can be further improved (e.g. in eda_boardgame.py maybe use a instead, and in prediction_model.py maybe split the main() function into more parts for better readability."

    • We have refactored the script so the main function is more readable: commit link
  3. Addressing comment by @Suraporn: "The line conda env -f envboard.yaml and 'conda activate envboard' together in Usage section of readme file make it difficult to copy, paste, and run it right away in command line. The author better put it in separated lines, so it is easy to copy and run in terminal command."

    • We have edited the Usage section so commands are easier to copy/paste: commit link