Submission: Group 10: Board Game Rating Predictor

Submitting authors: @xFiveRivers @erictsai1208 @marianagyby @ashwin2507

Repository: https://github.com/UBC-MDS/boardgame_rating_predictor Report link: https://github.com/UBC-MDS/boardgame_rating_predictor/blob/main/doc/boardgame_rating_predictor_report.Rmd Abstract/executive summary: Developing a board game is no easy feat. There are many parameters and dimensions of a game that define not only the overall game, but also the enjoyment of the user. Because of the high-dimensionality of board games, it is difficult to understand what needs to be incorporated in the process of development to create a good product. The goal of our project is to develop a model that can predict user scores from 1 to 10 based on a number of features and combinations to reduce the workload of trying to understand what to incorporate during game development.

Editor: @flor14 Reviewer: Suraporn Puangpanbut, Chester Wang, Natalie Cho, Lauren Zung

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: @Suraporn

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[ ] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: ~1h

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Great project and teamwork!, I enjoyed reading trough the project, the project is well organized with appropriate folders and meaningful names, it is easy to follow along the readme and the report. One thing the author might consider to add is about the background of boardgame, I have no background in boardgame so it is a bit difficult for me to understand the it, adding background of boardgame can help non-boardgame people understand it easier.
I like that you have the test function in each source code, it will certainly help when user call the function with wrong inputs. Some good test examples such as try-except in download_data.py, assert statement in 'preprocess_boardgame_data.py', also raise ValueError in eda_boardgame.py.
Unfortunately, I can not reproduce the environment since the commnad "conda env -f envboard.yaml" does not work. I think the author needs to use conda env export -f envboard.yaml or conda env export --from-history -f envboard.yaml to export conda environment. Then using conda env create --file envboard.yaml to create the environment from exported yaml file. I also would like to suggest the author to actually clone the repository, create the environment from provided yalm file, and test all codes to ensure that it is reproducible as well.
The line conda env -f envboard.yaml and 'conda activate envboard' together in Usage section of readme file make it difficult to copy, paste, and run it right away in command line. The author better put it in separated lines, so it is easy to copy and run in terminal command.
I can not render boardgame_rating_predictor_report.html. It seems like some codes there are wrong. I would like the author to recheck the report and make it possible to render in Github repository. It will be good if this html report can render properly, so the reader can read the summary report in this file rather than raw file format like boardgame_rating_predictor_report.Rmd.

Finally, congratulations to the authors on successfully building a data analysis project from a scratch to finish in such a short time. Well done!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: ~1h

Review Comments:

Proposal and analysis is thorough and properly explained.
Code for EDA and model creation were clear and easy to read.
The code to create the environment produced an error (missing an argument to specify creation of an env from a file).
Although mentioned in the description within the report, your scatterplot showing actual vs predicted scores could benefit from a legend explicitly stating which line portrayed what information.
Although not a major problem, your files may benefit from being all the same type to have consistency (the EDA was an .ipynb file using python and the final report created as .Rmd file)

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @ChesterAiGo (Chester Wang)

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[ ] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2 hours

Review Comments:

The instructions look very clear and easy to follow. Having that said I encountered same issues as mentioned by other reviewers above when reproducing it.
The repo is a bit too large and takes time to download. Personally I would recommend use .gitignore to avoid uploading the datasets and users can just download the data using the downloader scripts.
The code is of good quality in general and is well documented. There are a few scripts that can be further improved (e.g. in eda_boardgame.py maybe use a instead, and in prediction_model.py maybe split the main() function into more parts for better readability. )
The repo is well structured in general and the names are consistent. Nice work!
Extensive and in-depth EDA! I love the visualization part in particular, as the figures are just so clear and intuitive

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @lzung (Lauren Zung)

Conflict of interest

[X] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[X] Installation instructions: Is there a clearly stated list of dependencies?
[ ] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[X] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelines: Does the code adhere to well known language style guides?
[X] Modularity: Is the code suitably abstracted into scripts and functions?
[X] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[X] Computational methods: Is all the source code required for the data analysis available?
[X] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[X] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[X] Importance: Do the authors clearly state the importance for this research question?
[X] Background: Do the authors provide sufficient background information so that readers can understand the report?
[X] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[X] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[X] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2 hours

Review Comments:

Overall comments: Each of your scripts are well-documented in my opinion (the formatting of the modelling script especially - nice comments throughout). I think your tests are also relevant additions to each script. The report is engaging and contains well-formatted figures and a result table. Really great work overall!

As mentioned by others, the command to install the environment should be conda env create -f envboard.yaml, but this is a minor typo. Since the environment file has been added to your repository, users can just clone the repo, create and activate the environment from the directory, rather than downloading it separately in steps 1 and 2 under Usage. Unfortunately, I could not install the environment on my system (MacOS) so it might be worth mentioning what operating system(s) can replicate your analysis and/or include system-specific packages that might be needed. ResolvePackageNotFound:
- docopt-ng==0.8.1
- vl_convert
I noticed that in step 5 under your Usage section, you suggest running the same script twice to save data from two csvs - perhaps you could include another argument in your download script that will read the data from both links with only one command? This is just a suggestion, but it could streamline the analysis since it seems that both of these files are meant to be processed at the same time. Similarly, there is no mention of the command to run the EDA even though you have a script prepared, so that would be nice to include.
I think the use of some of the helper functions in eda_boardgame.py is slightly unnecessary, since many of them are not actually used more than once in the script (plot_rating_distribution, plot_numeric_feature_distribution, plot_top_10_categories, etc.). I think that if the code was kept inside main(), it could be easier to follow but this is mostly up to preference. If you do want to streamline main(), I would suggest saving the charts directly in the helpers instead of creating intermediate objects (i.e. return save_chart(rating_plot, out_dir + "rating_distribution.png")). In a similar scope, I noticed that plot_top_10_mechanics() and plot_top_10_categories() are pretty similar with only a few settings changed, so you could use a helper here for both grouped bar charts and set some parameters to change depending on whether you're plotting the mechanics or categories.
This is just a question that came to mind, but it seems that you guys are transforming the categorical features with literal_eval in both the EDA and prediction steps. I'm not entirely sure what this does, but maybe this step could be done during your data preprocessing stage (preprocess_boardgame_data.py) so that exported data is already suitable for plotting/analysis?
Along with the Ridge and RandomForestRegressor models, it would be cool to see if your performance improves with the new regularization techniques we have covered in 573 (adding lasso or elastic net regression!) Similarly, since you chose $R^2$ as your scoring metric, perhaps MAPE or RMSE could be a nice option to show alongside that is interpretable relative to your target, user rating.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Thank you to @Suraporn, @ChesterAiGo, @lzung, and @Natalie-cho for all your helpful feedback! All your feedback points were valuable and encouraging but could not all be addressed due to time constraints. Here are a few points highlighting the comments we addressed:

Addressing comment by @lzung: "In a similar scope, I noticed that plot_top_10_mechanics() and plot_top_10_categories() are pretty similar with only a few settings changed, so you could use a helper here for both grouped bar charts and set some parameters to change depending on whether you're plotting the mechanics or categories."
- We refactored the plot code accordingly through multiple commits: PR link
Addressing comment by @lzung: "Similarly, since you chose as your scoring metric, perhaps MAPE or RMSE could be a nice option to show alongside that is interpretable relative to your target, user rating."

We now show MAPE scores in addition to R^2 scores so the results are more interpretable: Commit link

Addressing comment by @Suraporn: "I can not render boardgame_rating_predictor_report.html. It seems like some codes there are wrong. I would like the author to recheck the report and make it possible to render in Github repository. It will be good if this html report can render properly, so the reader can read the summary report in this file rather than raw file format like boardgame_rating_predictor_report.Rmd."

We added a github_document and pdf_document so the report can appear rendered directly on GitHub without having to download or run the full analysis.

Addressing comment by @Natalie-cho and others: "The code to create the environment produced an error (missing an argument to specify creation of an env from a file)."
- We fixed the typo in read me environment command: Commit link
Addressing comment by @ChesterAiGo: "There are a few scripts that can be further improved (e.g. in eda_boardgame.py maybe use a instead, and in prediction_model.py maybe split the main() function into more parts for better readability."
- We have refactored the script so the main function is more readable: commit link
Addressing comment by @Suraporn: "The line conda env -f envboard.yaml and 'conda activate envboard' together in Usage section of readme file make it difficult to copy, paste, and run it right away in command line. The author better put it in separated lines, so it is easy to copy and run in terminal command."
- We have edited the Usage section so commands are easier to copy/paste: commit link

UBC-MDS / data-analysis-review-2022