Submission: Group 6: Game of Thrones Fatality Prediction

Submitting authors: Ian MacCarthy, Arturo Rey, Thomas Jiang, Sian Zhang Repository: https://github.com/UBC-MDS/GoT-fatality-prediction Report link: https://ianm99.github.io/Milestone-3/got_fatality_predictor_book.html Abstract/executive summary: We build a prediction tool that predicts whether a given character from the Game of Thrones books will survive to the end of the series. To do this, we implement a logistic regression model on a data set containing character information. The model is not able to achieve prediction accuracy any better than 65%. This is likely due to an absence of strong patterns in the plot and cast of characters that would allow us to easily answer such a question.

Editor: @ttimbers Reviewer: Ella Hein, Sid Grover, Yimeng Xia, Rory White

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: rorywhite200

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[ ] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

This is a great, gory project, well done! I encountered a few things that I was confused about that I'll list below, but overall I really enjoyed reading this and I was eventually able to get the scripts running.

The link provided to the online report book ( https://ianm99.github.io/Milestone-3/got_fatality_predictor_book.html) does not contain the final html content. For instance, the link does not contain the cross references to figures but the html files do.
It was difficult to understand the book features and their interpretation. For instance the text mentions that “there is a notable positive correlation with whether the character is alive in book 4”. Do the features book 1, 2, 3, 4 mean "is the character still alive book 4"? This implies that a very accurate predictive model could be possible because for many characters you would already know if they are dead? If I am missing the point here, you could add something in the introduction to make it clear what we are really predicting.
In the heatmap there is a feature called "alive" but this is not mentioned in the feature table. What is this? Also, how many features were dropped in total?
In the src folder there is a random script called "Ians_script.py" it does not appear to do anything but this could be removed.
In the results section the text says “Having found this best performing model” but it is not stated which model is the best performing one. Also, the f1 score obtained on the test data is around 0.7 but only around 0.5 during cross validation. Some interpretation is needed here. These tables could also do with legends that make clear exactly what they are showing.
The scripts provided in the readme file work well but they require manually pressing y to replace each file many times. You could replace the last line of the script with yes | cp -rf book/_build/html/* docs/ to fix this.
Finally, the discussion could include some consideration of how well the model performs compared to similar studies. It looks like the dataset is from a group that used it for the same purpose of predicting game of thrones deaths. How accurate was their model? Are there any improvements you could make to improve the accuracy of yours based on this information?

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: YimengXia

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[X] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[X] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelides: Does the code adhere to well known language style guides?
[X] Modularity: Is the code suitably abstracted into scripts and functions?
[X] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[X] Computational methods: Is all the source code required for the data analysis available?
[X] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[X] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[X] Importance: Do the authors clearly state the importance for this research question?
[X] Background: Do the authors provide sufficient background information so that readers can understand the report?
[X] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[ ] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[X] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Great job on the introduction to Game of Thrones! As a viewer unfamiliar with this novel series, I found your explanation very clear and accessible. It helped me grasp the project and understand each variable without any confusion.
It would enhance the transparency of your work if you could include a link to the source of data, so that viewers could closely examine the information, including details on how the data were collected and a description of each column.
In the Methods and Results section of the report, you mentioned "created a heatmap to illustrate the correlation of each feature with "isAlive"." However, in the plot, I couldn't find a variable named "isAlive"; instead, I noticed you referred to it as "target." It would be helpful to explicitly state that you changed the variable name from "isAlive" to "target" and provide a brief explanation for this modification. This ensures clarity for viewers and helps them understand the reasoning behind the variable name change.
Consider exploring future directions and refining the model through e.g. advanced feature engineering, exploring interaction effects among variables. These enhancements could elevate the predictive accuracy and offer a more nuanced understanding of character fate prediction in Game of Thrones.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: killerninja8 (GH), sid2000 (GH Enterprise)

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[ ] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5h

Review Comments:

Fantastic work, team 6! As someone who's not familiar with GoT at all, I thoroughly enjoyed reading your report (and repo) and found it very clear, concise, and easy to follow. Here's my perspective on the report & repo.

In the README, one of the links to visualise the nb does not render properly. I think you may have copied the contents of a jupyterbooks to a new docs folder.

"To visualize the notebook in a browser, go to the following link: https://ianm99.github.io/Team-6-publishing/index.html"
Providing an explanation for dropping columns may give the readers a clearer picture of your rationale. Perhaps you can make a section titled "Feature Selection".
Further exploring relationships between variables through statistical tests would be interesting (chi-square test, scatterplot matrix)
Since you're using a variety of models, it may be easy to build a new "voting classifier" to potentially improve results.
Try using more complex methods to capture subtle patterns in noisy data. I'd suggest LightGBM or XGBoost because they sequentially improve model performance.

What I enjoyed: great visualisations, use of a variety of models and scoring metrics, presence of PR/ROC curves, and clear writing. Great project, overall!

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer:

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
- Specify that you are using the the MIT license in your license.md file, and add a license section to your readme with a link to the MIT license and license.md file.

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support
- You include a contributing file but do not mention any guidelines in your README, I would include more about the community guidelines there as well.

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing:

Review Comments:

Overall, cool project! I enjoyed reading your analysis and thought you did a great job. Below is a breakdown of my notes as I worked through your project.

General project organization

I really like your script organization. I think all are very well organized and easy to understand the functionality of just from the names.
Your src folder is a bit more difficult to follow. I would suggest following the same naming convention as in the script folder. For example implementing verb function names such as get_short_results instead of short_results, ect.. Also, I think the ian_script.py should have a different name and not be in the src folder.
All of you tests look great except for the model.pkl, which looks misplaced.
Include an explicit raw data folder, it's a bit unclear if the files outside of the processed data folder are actually raw.
It's repetitive to have the tables and figures generated for your report in the results folder as well as in the book folder. I would suggest having them only in results.

Running Analysis

The environment did not successfully build for me, although this very well could be a personal issue.
The docker container built successfully, but the instructions need to be updated as src/calculate_missing_percentage.py does not exist and throws an error.

Report

Add names and affiliations to report, not just README.
Very engaging report! I thought it was to the point and easy to understand, as well as an interesting analysis!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

UBC-MDS / data-analysis-review-2023