UBC-MDS / data-analysis-review-2021

1 stars 4 forks source link

Submission: Group 22: Olympic Medal Htest #26

Open ruben1dlg opened 2 years ago

ruben1dlg commented 2 years ago

Submitting authors: @ming0701 @stevenleung2018 @squisty @ruben1dlg

Repository: https://github.com/UBC-MDS/olympic_medal_htest Report link: https://github.com/UBC-MDS/olympic_medal_htest/blob/main/doc/05_final_report.md Abstract/executive summary: For this project we will attempt to make a hypothesis test to answer the question: is the proportion of athletes younger than 25 that win a medal greater than the proportion of athletes of age 25 or older that win a medal? We chose this question and this topic since it is a pop culture subject for which we think strong domain is not really needed. It is important to note that the idea for this project is to be able to wrangle the data and test our hypothesis with the tools and techniques that we know how to use at the moment.

The data used in this project is a public domain data set of the olympics with information of athletes like nationality, sport/event, year, age, among others, extracted from the publicly available tidytuesday data sets. Each row in the data set represents information of an athlete competing in a certain event, including information of whether the athlete won a medal or not. The testing results and analysis will be presented in the final report.

To answer the question mentioned, we will perform a hypothesis test for the difference in proportions. First, we will perform an EDA (Exploratory Data Analysis) to get a general idea of how the data looks like and we will show this work in the EDA document for this project.

Given that we are going to perform a hypothesis, we defined our null and alternative hypothesis as follows:

H0: the proportion of athletes younger than 25 that win a medal is equal to the proportion of athletes of age 25 or older that win a medal HA: the proportion of athletes younger than 25 that win a medal is greater than the proportion of athletes of age 25 or older that win a medal We will use the simulation/permutation technique, and our test statistic will be the difference in proportions. We will check both the p-value and the place where the observed test statistic falls on the null distribution to determine if we can reject our null hypothesis or not. We will use a significance level of alpha = 0.05, and this will be a one-sided test.

The EDA performed and reports for the data set can be found in the src folder in this repo.

Editor: @flor14 Reviewer: @christopheralex @Rowansiv @ciciecho-ds @PavelLevchenko

Rowansiv commented 2 years ago

Data analysis review checklist

Reviewer: Rowansiv

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 1 hour

Review Comments:

  1. Although the scripts all function as required, the raw data isn't stored anywhere in the repository. I believe the first script run src/01_download_data.py saves it into the data folder which occurs when I run it locally but there is no raw data in the repository data folder.

  2. The methods section tells us which packages are used in R and python but there isn't much mention regarding the methodology of what you did to the data, what you cleaned, and why you are using particular features.

  3. Unsure if it's a problem with my local machine but I am unable to view any of the diagrams that are in the report and they cannot be found in any of the repository folders.

  4. I would suggest using more informative and meaningful caption names for your images and tables. Simply rewriting the heading of the graphic doesn't really tell us much. It would be very helpful for readers for them to understand why you are inserting this graphic and what it's trying to tell them.

  5. The report itself does not specify the group members in your group. I can see from the repository and from the README who the contributors are but it's not explicitly said in the final report.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

ciciecho-ds commented 2 years ago

Peer Review

Reviewer: @ciciecho-ds

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing:

2 hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

  1. I like that it seems you guys came up with this question yourself!
  2. Your introduction is clearly written and easy to understand.
  3. The explanation you provided on your dataset is helpful for others to understand what it includes quickly.
  4. The logical progression in your EDA analysis is easy to follow.
  5. Minor typo: in the final report, you said "and placing our observed test statistic on the plot in figure 1", I think you mean figure 5.
  6. You data folder doesn't contain anything except for a placeholder file. I think it's better to include the actual data here for easy access. What if the URL where you download the data is modified or no longer available?
  7. Your CONTRIBUTING file might have a format issue because it has a strange box at the top. Also, the above criteria mention that you are supposed to address how to "seek support" in this file. I totally understand that this might not apply to this project but I guess it's good practice to mention it.
  8. It's good practice to include docstrings in your functions in addition to comments.
  9. "Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?" I think you need to address this by stating why you picked the particular statistical test, whether you data met its assumptions, and perhaps including why you didn't choose other methods, etc.
  10. Overall, good job!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

christopheralex commented 2 years ago

Data analysis review checklist

Reviewer:

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2 hours

Review Comments:

Some of the following comments are personal preferences as a reviewer and are open for discussion -

  1. Could not find the data/raw folder in the repository . Is this because the data is too large ? If so it is completely understandable but an explanation for the same would have been appropriate.
  2. There are multiple places in the repository where resultant files are missing in the repository. Such as plots in the results folder. It will be helpful for users who don't wish to reproduce your results but only view the analysis to have such files present in the repository.
  3. Although the style used for code writing is consistent throughout . I think you could have used functions in scripts rather than putting it all under the main function for better modularisation and core readability.
  4. Wrangling of data required for 04_htest.R could have been done in 02_cleaning.py or another module.
  5. I found too many links in https://github.com/UBC-MDS/olympic_medal_htest/blob/main/doc/05_final_report.md under Exploratory data analysis heading . I feel unnecessary links could be removed. ( links related to data could be put up in another section such as Data reference and included in the end or during data attribution ).
  6. No figures are rendered in 05_final_report.md.
  7. The code used to perform the analysis and create this report can be found here: https://github.com/UBC-MDS/olympic_medal_htest Links back to the repository not to the actual code .
  8. The final report does not talk about future scope of the project or explains if this is a finished project, there is also no inference to how the analysis can be applied in the real world.

Overall great work. I really liked the selected topic and the conclusion given in the end .

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

plevchen commented 2 years ago

Data analysis review checklist

Reviewer: @PavelLevchenko

Conflict of interest

Code of Conduct

General checks

Documentation

Code quality

Reproducibility

Analysis report

Estimated hours spent reviewing: 2 hours

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Somewhat misleading

  1. 01_download_data.py could be improved by adding one more exception statement related to connectivity issues for link with original data source. This can be managed by requests library, such as
    try: 
        request = requests.get(url)
        request.status_code == 200
    except Exception as req:
        print(req)
        print("Website at the provided url does not exist")
  1. 02_cleaning.py script could be improved by removing something after EDA phase. I think in your type of work, EDA should be highly interleaved with cleaning phase. For example, in original Kaggle post, where that data was taken from was done analysis on outliers and was revealed that a lot of athletes with medals, which had 80+ years old were competing in sports removed now. Those sports were not quite real, but more of arts competitions. At least at the cleaning phase you could remove those irrelevant type of sports.

  2. Inside 03_EDA_olympics.py could be added more comments what Figure 1 or Figure 2 is supposed to be showing. It is difficult to read through the code and trying to understand main message of those plots. In terms of comments I would appreciate not just general description, that you are showing age vs height, but why it could be relevant, what is your story, ideas that you try to convey with those plots. If some of the plots after initial EDA became irrelevant, maybe you should remove them from final scripts

  3. In 03_EDA_olympics.py you also use height, weight and year criteria. I was confused at seeing those in hypothesis related to age. I think more relevant use of any such criteria could be for filtering data before making statistical inference tests. For your task I think more relevant in terms of EDA could be not heigh/weight but a type of sport. We should expect to have very young sportsmen in gymnastics and in some other sports as sailing or curling age distribution could be very different towards older age.

  4. I found "Modularity" a little bit lacking. At least in 04_htest.R has one huge main function, not splitting at several smaller functions

  5. I encountered following error message after running make all (I used olympic_env environment as was recommended, most likely there is some conflict in some of the libraries. Especially I found altair not quite friendly with make. Seaborn library is much better in my opinion): "Please check if the saving path is correct and is writable: results ... json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 2) make: *** [Makefile:24: results/03_EDA.html] Error 1 "

  6. I could not find any png files in saved git directory and I could not obtain them after running make script. So, it is difficult to evaluate quality of EDA or reading report without pictures. I think you could at least upload your picture in the github folder so people would look at them if they encountered error while trying to reproduce scripts

  7. README.md file contains outdated, confusing information that "The EDA performed and reports for the data set can be found in the src folder in this repo."

Overall great works, guys! You selected a very interesting topic, I enjoyed exploring that and refreshing concepts from 552 course

stevenleung2018 commented 2 years ago

Six responses to TA's review and peer reviews:

We greatly appreciate all the time and effort for reviewing our hypothesis testing project. While all feedbacks are thoroughly reviewed and discussed internally, we would not have the time and resources at the time of this writing to respond to all issues. But here are the issues that we have addressed based on the great feedback up to this date, December 10, 2021.

Response 1 - issue being addressed:

the licence should be copyrighted to your names not MDS (it is your work)

Point 2 from TA feedback

Link to the commit

Response 2 - issue being addressed:

Question could be restated for clarity, something like: Is age associated with success at the Olympics? Is the idea to compare age categories given all other features being equal? Why was 25 chosen as the cutoff? What kind of visualization do you plan to make?

Point 3 from TA feedback

Link to the commit

Response 3 - issue being addressed:

Write in general not to the TAs ("And I am including it for your convenience") Some of axis labels are unreadable What is your interpretation of the preliminary analysis? Do any predictors stand out as useful?

Point 5 from TA feedback

Link to the commit

Response 4 - Issue being addressed:

Your CONTRIBUTING file might have a format issue because it has a strange box at the top. Also, the above criteria mention that you are supposed to address how to "seek support" in this file. I totally understand that this might not apply to this project but I guess it's good practice to mention it.

Point 7 from Peer review by @ciciecho-ds

Link to the commit

Response 5 - issue being addressed:

I'm confused by the result: if the true diff is -0.025 then it's absolute value is much greater than your significance threshold and it's far outside the null distribution. This seems like the opposite conclusion to the one you make...

Point 4 from TA feedback: Link to the commit

Response 6 - issue being addressed:

Minor typo: in the final report, you said "and placing our observed test statistic on the plot in figure 1", I think you mean figure 5.

Point 5 from Peer review by @ciciecho-ds Link to the commit