Open ruben1dlg opened 2 years ago
Although the scripts all function as required, the raw data isn't stored anywhere in the repository. I believe the first script run src/01_download_data.py
saves it into the data folder which occurs when I run it locally but there is no raw data in the repository data folder.
The methods section tells us which packages are used in R and python but there isn't much mention regarding the methodology of what you did to the data, what you cleaned, and why you are using particular features.
Unsure if it's a problem with my local machine but I am unable to view any of the diagrams that are in the report and they cannot be found in any of the repository folders.
I would suggest using more informative and meaningful caption names for your images and tables. Simply rewriting the heading of the graphic doesn't really tell us much. It would be very helpful for readers for them to understand why you are inserting this graphic and what it's trying to tell them.
The report itself does not specify the group members in your group. I can see from the repository and from the README who the contributors are but it's not explicitly said in the final report.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
2 hours
Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Some of the following comments are personal preferences as a reviewer and are open for discussion -
04_htest.R
could have been done in 02_cleaning.py
or another module.Exploratory data analysis
heading . I feel unnecessary links could be removed. ( links related to data could be put up in another section such as Data reference and included in the end or during data attribution ).05_final_report.md
.The code used to perform the analysis and create this report can be found here: https://github.com/UBC-MDS/olympic_medal_htest
Links back to the repository not to the actual code .Overall great work. I really liked the selected topic and the conclusion given in the end .
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.
This was derived from the JOSE review checklist and the ROpenSci review checklist.
Somewhat misleading
01_download_data.py
could be improved by adding one more exception statement related to connectivity issues for link with original data source. This can be managed by requests
library, such as
try:
request = requests.get(url)
request.status_code == 200
except Exception as req:
print(req)
print("Website at the provided url does not exist")
02_cleaning.py
script could be improved by removing something after EDA phase. I think in your type of work, EDA should be highly interleaved with cleaning phase. For example, in original Kaggle post, where that data was taken from was done analysis on outliers and was revealed that a lot of athletes with medals, which had 80+ years old were competing in sports removed now. Those sports were not quite real, but more of arts competitions. At least at the cleaning phase you could remove those irrelevant type of sports.
Inside 03_EDA_olympics.py
could be added more comments what Figure 1
or Figure 2
is supposed to be showing. It is difficult to read through the code and trying to understand main message of those plots. In terms of comments I would appreciate not just general description, that you are showing age vs height, but why it could be relevant, what is your story, ideas that you try to convey with those plots. If some of the plots after initial EDA became irrelevant, maybe you should remove them from final scripts
In 03_EDA_olympics.py
you also use height
, weight
and year
criteria. I was confused at seeing those in hypothesis related to age
. I think more relevant use of any such criteria could be for filtering data before making statistical inference tests. For your task I think more relevant in terms of EDA could be not heigh/weight but a type of sport. We should expect to have very young sportsmen in gymnastics and in some other sports as sailing or curling age distribution could be very different towards older age.
I found "Modularity" a little bit lacking. At least in 04_htest.R
has one huge main
function, not splitting at several smaller functions
I encountered following error message after running make all
(I used olympic_env environment as was recommended, most likely there is some conflict in some of the libraries. Especially I found altair
not quite friendly with make
. Seaborn library is much better in my opinion):
"Please check if the saving path is correct and is writable:
results
...
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 2)
make: *** [Makefile:24: results/03_EDA.html] Error 1
"
I could not find any png
files in saved git directory and I could not obtain them after running make
script. So, it is difficult to evaluate quality of EDA or reading report without pictures. I think you could at least upload your picture in the github folder so people would look at them if they encountered error while trying to reproduce scripts
README.md file contains outdated, confusing information that "The EDA performed and reports for the data set can be found in the src folder in this repo."
Overall great works, guys! You selected a very interesting topic, I enjoyed exploring that and refreshing concepts from 552 course
We greatly appreciate all the time and effort for reviewing our hypothesis testing project. While all feedbacks are thoroughly reviewed and discussed internally, we would not have the time and resources at the time of this writing to respond to all issues. But here are the issues that we have addressed based on the great feedback up to this date, December 10, 2021.
the licence should be copyrighted to your names not MDS (it is your work)
Point 2 from TA feedback
Question could be restated for clarity, something like: Is age associated with success at the Olympics? Is the idea to compare age categories given all other features being equal? Why was 25 chosen as the cutoff? What kind of visualization do you plan to make?
Point 3 from TA feedback
Write in general not to the TAs ("And I am including it for your convenience") Some of axis labels are unreadable What is your interpretation of the preliminary analysis? Do any predictors stand out as useful?
Point 5 from TA feedback
Your CONTRIBUTING file might have a format issue because it has a strange box at the top. Also, the above criteria mention that you are supposed to address how to "seek support" in this file. I totally understand that this might not apply to this project but I guess it's good practice to mention it.
Point 7 from Peer review by @ciciecho-ds
I'm confused by the result: if the true diff is -0.025 then it's absolute value is much greater than your significance threshold and it's far outside the null distribution. This seems like the opposite conclusion to the one you make...
Point 4 from TA feedback: Link to the commit
Minor typo: in the final report, you said "and placing our observed test statistic on the plot in figure 1", I think you mean figure 5.
Point 5 from Peer review by @ciciecho-ds Link to the commit
Submitting authors: @ming0701 @stevenleung2018 @squisty @ruben1dlg
Repository: https://github.com/UBC-MDS/olympic_medal_htest Report link: https://github.com/UBC-MDS/olympic_medal_htest/blob/main/doc/05_final_report.md Abstract/executive summary: For this project we will attempt to make a hypothesis test to answer the question: is the proportion of athletes younger than 25 that win a medal greater than the proportion of athletes of age 25 or older that win a medal? We chose this question and this topic since it is a pop culture subject for which we think strong domain is not really needed. It is important to note that the idea for this project is to be able to wrangle the data and test our hypothesis with the tools and techniques that we know how to use at the moment.
The data used in this project is a public domain data set of the olympics with information of athletes like nationality, sport/event, year, age, among others, extracted from the publicly available tidytuesday data sets. Each row in the data set represents information of an athlete competing in a certain event, including information of whether the athlete won a medal or not. The testing results and analysis will be presented in the final report.
To answer the question mentioned, we will perform a hypothesis test for the difference in proportions. First, we will perform an EDA (Exploratory Data Analysis) to get a general idea of how the data looks like and we will show this work in the EDA document for this project.
Given that we are going to perform a hypothesis, we defined our null and alternative hypothesis as follows:
H0: the proportion of athletes younger than 25 that win a medal is equal to the proportion of athletes of age 25 or older that win a medal HA: the proportion of athletes younger than 25 that win a medal is greater than the proportion of athletes of age 25 or older that win a medal We will use the simulation/permutation technique, and our test statistic will be the difference in proportions. We will check both the p-value and the place where the observed test statistic falls on the null distribution to determine if we can reject our null hypothesis or not. We will use a significance level of alpha = 0.05, and this will be a one-sided test.
The EDA performed and reports for the data set can be found in the src folder in this repo.
Editor: @flor14 Reviewer: @christopheralex @Rowansiv @ciciecho-ds @PavelLevchenko