Submission: Group 8: Pokemon Type Predictor

Submitting authors: @WilfHass @carolinetang77 @missarah96 @vincentho32

Repository: https://github.com/UBC-MDS/pokemon-type-predictor Report link: https://github.com/UBC-MDS/pokemon-type-predictor/blob/main/doc/final_report.md Abstract/executive summary: In this project, we attempt to build a classification model using two algorithms: - Nearest Neighbours and a Support Vector Machine. We will use this classification model to classify a Pokemon's type (of which there are 18 possible types) based on the other stats (such as attack, defense, etc.) that it has. We use accuracy as the metric to score our models since there is no detriment to false positives or negatives, but we do want to know how many of the unknown Pokemon will be predicted correctly. On the unseen test data, the -NN model predicted 60% of the new Pokemon correctly while the SVC model predicted 67% correctly. Since these are not very accurate results, we recommend trying different estimators to fill up that Pokedex!

The data is found here. The data was cleaned by HansAnonymous and originally developed by simsketch. The original data can be found in the Pokemon database. All rights belong to their respective owners. Each row in the dataset contains a different Pokemon with various attributes. The attributes are measurements of the base Pokemon, such as attack, speed or defense.The different types of Pokemon are closely related to the other attributes it possesses. For example, a rock type Pokemon is likely to have higher defensive statistics (such as defense or health points) as well as rock-type abilities. It is also most likely to be coloured grey.

Editor: @flor14 Reviewer: Revathy ponnambalam, Ashwin Babu, Zilong Yi, Tony Zoght

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: Revathy Ponnambalam (revathyponn)

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[ ] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Hello Gym trainers, I would like you to inform that going through your group project was fun and the writeup made the reader enjoy it thoroughly. I liked how the readme file was structured in a way that it will be easy for anyone new to follow and run the replicate the project.
It is mentioned that the dependencies are listed in the 'env-poke-type-pred.yaml' and code to install the environment is also mentioned but the list of dependencies are not mentioned in the Readme file.
Reason for choosing the specific models could have been added.
In the preprocessing.py, there are few columns which are dropped and other columns which are kept. One or two sentences about this decision can be added.
In the raw csv file, the column ABILITY 2 has 50% missing values and ABILITY Hidden column has 25% missing values. In the preprocessing, the missing values of the above columns are filled with the values from the ABILITY1 column. A few sentences details about this would be ideal to understand it further.
In the readme file the pipeline image is not rendered properly.
Overall, the scripts are well written and readable, doc strings are complete and folders/sub folders are properly structured. Perfect!!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: Ashwin Babu (ashwin2507)

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

The Pipeline section in the Readme.md needs to be fixed as I can see some rendering issues.
The documentation is awesome, the project is easy to follow and the objective of the project is clear. I would add a few important charts/plots in the EDA section of your final report.
The Final report is well structured and is very convenient for the reader to understand, the one addition I would like to see is adding some more information in the results section about the hyperparameter tuning, I see that you have talked about k-NN hyperparameters, would suggest to include the same for SVC as well since the training scores are 100% in both cases hyperparameter tuning would be a crucial insight to the final report.
Since the results are somewhat on the lower end, a reason for choosing the specific model could have been added or the usage of some other models like Naive Bayes or Logistic regression could have given more insights for a multiclass predicting problem.
Overall the code, organization of files, and script usage are great. The main goal is clearly stated and the end result has been perfectly worded, easy to follow instructions, and a very interesting project.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: Tony Zoght (tzoght)

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[ ] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

2.5 hours (including the time to write this review and build the code and run the analysis)

Review Comments:

First of all, I really enjoyed reading your report and I think you did a great job. I have a few comments and suggestions that I hope will help you improve your report and overall git repository hygiene (see General Comments below)

General comments

Missing links in README.md (https://github.com/UBC-MDS/pokemon-type-predictor#pipeline) and minor formatting issues in the report
More details in the README.md about the features of the data, you don't want the reader to open the CSV file to see what the columns are.
For readers who are not familiar with conda (maybe they use pip and virtualenv), it is useful if this project indicates that it uses conda as a prerequisite to build the project.
Adding convenient links from the README.md to license, contributing, code of conduct would be helpful (just like you've done for the report).
Inline comments in the code would help the reader understand what's going on, especially in the model training code (minor point)

About the report

The introduction is very clear and well written, but the data story is a bit lacking in details since the README does not talk much about the data and the features.
I think adding a bit more details on the choice of scoring metric would be better for this report, for example why did you choose accuracy over other metrics ?
The confusion matrix is distracting, with multi-class classification, it's a bit distracting when trying to read the conclusions. What value does it add to the report? I think you could just add the accuracy score and the classification report.
In your discussion/conclusion, you mention that the model has a 100% training accuracy, and highlight that it's a sign of overfitting, then you move on to mention that a model with 1 neighbor has much less accuracy, with no prior mention that you were doing hyperparameter tuning (almost an afterthought). A bit of clarity on the conclusion would make this report even better.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: Zilong Yi

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[ x Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 hour

Review Comments:

I really enjoyed reading through your report, especially on how you are motivated and the introductions on background and features. Well done!!
I noticed that the report link in README file is linked to .md file. Personally, I would prefer a webpage. You can create a html file and then use this to create a html preview.
Makefile is included in the repo, but this does not show up on the README file. It would be nice to include it as an one-step way to reproduce the analysis.
I agree with the models you choose. However, the size of data set is one thing that I concern. You are doing milticlass classification, and from the result of baseline, I can see that there are around 20 cases for each class. Personally, I suggests, if possible, to acquire a larger data set, as 20 might not be enough for classification in ML.
In the report, it was mentioned column of total is dropped because it is just merely the sum of other numeric features. Given other numeric features are highly correlated, is it possible to do classification just by it own, or include it as it is, which might end up with total having a dominant coefficient. Some more clarification of why this column is dropped would make report even better.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Group 8 - Improvements based on the feedback received:

Feedbacks from Revathy

Add reason for choosing specific models (in report and README) (commit link)
Updating dependency list (commit link)

Feedbacks from Ashwin

Add more information about hyperparameter tuning in report (commit link)
Show more EDA plots in the final report (commit link)

Feedbacks from Tony

Add more details in the README.md about the features of the data (commit link)
Add reason why we chose accuracy as our scoring metric (commit link)
Add links to license, code of conduct (commit link)

Feedbacks from Zilong

Add webpage link of final report in README (commit link)

UBC-MDS / data-analysis-review-2022