UBC-MDS / data-analysis-review-2023

0 stars 0 forks source link

Submission: Group 6: Game of Thrones Fatality Prediction #12

Open AReyH opened 7 months ago

AReyH commented 7 months ago

Submitting authors: Ian MacCarthy, Arturo Rey, Thomas Jiang, Sian Zhang Repository: https://github.com/UBC-MDS/GoT-fatality-prediction Report link: https://ianm99.github.io/Milestone-3/got_fatality_predictor_book.html Abstract/executive summary: We build a prediction tool that predicts whether a given character from the Game of Thrones books will survive to the end of the series. To do this, we implement a logistic regression model on a data set containing character information. The model is not able to achieve prediction accuracy any better than 65%. This is likely due to an absence of strong patterns in the plot and cast of characters that would allow us to easily answer such a question.

Editor: @ttimbers Reviewer: Ella Hein, Sid Grover, Yimeng Xia, Rory White

rorywhite200 commented 7 months ago

Data analysis review checklist

Reviewer: rorywhite200

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hours

Review Comments:

This is a great, gory project, well done! I encountered a few things that I was confused about that I'll list below, but overall I really enjoyed reading this and I was eventually able to get the scripts running.

  1. The link provided to the online report book ( https://ianm99.github.io/Milestone-3/got_fatality_predictor_book.html) does not contain the final html content. For instance, the link does not contain the cross references to figures but the html files do.
  2. It was difficult to understand the book features and their interpretation. For instance the text mentions that “there is a notable positive correlation with whether the character is alive in book 4”. Do the features book 1, 2, 3, 4 mean "is the character still alive book 4"? This implies that a very accurate predictive model could be possible because for many characters you would already know if they are dead? If I am missing the point here, you could add something in the introduction to make it clear what we are really predicting.
  3. In the heatmap there is a feature called "alive" but this is not mentioned in the feature table. What is this? Also, how many features were dropped in total?
  4. In the src folder there is a random script called "Ians_script.py" it does not appear to do anything but this could be removed.
  5. In the results section the text says “Having found this best performing model” but it is not stated which model is the best performing one. Also, the f1 score obtained on the test data is around 0.7 but only around 0.5 during cross validation. Some interpretation is needed here. These tables could also do with legends that make clear exactly what they are showing.
  6. The scripts provided in the readme file work well but they require manually pressing y to replace each file many times. You could replace the last line of the script with yes | cp -rf book/_build/html/* docs/ to fix this.
  7. Finally, the discussion could include some consideration of how well the model performs compared to similar studies. It looks like the dataset is from a group that used it for the same purpose of predicting game of thrones deaths. How accurate was their model? Are there any improvements you could make to improve the accuracy of yours based on this information?

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

YimengXia commented 7 months ago

Data analysis review checklist

Reviewer: YimengXia

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5 hours

Review Comments:

  1. Great job on the introduction to Game of Thrones! As a viewer unfamiliar with this novel series, I found your explanation very clear and accessible. It helped me grasp the project and understand each variable without any confusion.

  2. It would enhance the transparency of your work if you could include a link to the source of data, so that viewers could closely examine the information, including details on how the data were collected and a description of each column.

  3. In the Methods and Results section of the report, you mentioned "created a heatmap to illustrate the correlation of each feature with "isAlive"." However, in the plot, I couldn't find a variable named "isAlive"; instead, I noticed you referred to it as "target." It would be helpful to explicitly state that you changed the variable name from "isAlive" to "target" and provide a brief explanation for this modification. This ensures clarity for viewers and helps them understand the reasoning behind the variable name change.

  4. Consider exploring future directions and refining the model through e.g. advanced feature engineering, exploring interaction effects among variables. These enhancements could elevate the predictive accuracy and offer a more nuanced understanding of character fate prediction in Game of Thrones.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

killerninja8 commented 7 months ago

Data analysis review checklist

Reviewer: killerninja8 (GH), sid2000 (GH Enterprise)

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1.5h

Review Comments:

Fantastic work, team 6! As someone who's not familiar with GoT at all, I thoroughly enjoyed reading your report (and repo) and found it very clear, concise, and easy to follow. Here's my perspective on the report & repo.

  1. In the README, one of the links to visualise the nb does not render properly. I think you may have copied the contents of a jupyterbooks to a new docs folder.

    "To visualize the notebook in a browser, go to the following link: https://ianm99.github.io/Team-6-publishing/index.html"

  2. Providing an explanation for dropping columns may give the readers a clearer picture of your rationale. Perhaps you can make a section titled "Feature Selection".
  3. Further exploring relationships between variables through statistical tests would be interesting (chi-square test, scatterplot matrix)
  4. Since you're using a variety of models, it may be easy to build a new "voting classifier" to potentially improve results.
  5. Try using more complex methods to capture subtle patterns in noisy data. I'd suggest LightGBM or XGBoost because they sequentially improve model performance.

What I enjoyed: great visualisations, use of a variety of models and scoring metrics, presence of PR/ROC curves, and clear writing. Great project, overall!

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

ella-irene commented 7 months ago

Data analysis review checklist

Reviewer:

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

2

Review Comments:

Overall, cool project! I enjoyed reading your analysis and thought you did a great job. Below is a breakdown of my notes as I worked through your project.

General project organization

Running Analysis

Report

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.