Submission: GROUP 11: Horror Movie Revenue and Ratings

Submitting authors: J99thoms, Lorraine97, AguilarRaul, Hongjian-Sam-Li

Repository: https://github.com/UBC-MDS/horror_movies Report link: https://github.com/UBC-MDS/horror_movies/blob/main/notebooks/EDA_keys.ipynb Abstract/executive summary:

inferential research question is whether 'high' rated horror movies have a larger median revenue than 'low' rated horror movies (among those with non-zero revenue).

Considering only horror movies with non-zero revenue, let $R_h$ be the population median revenue (in USD) of horror movies with average ratings greater than the median average rating of horror movies, let $R_l$ be the population median revenue (in USD) of horror movies with average ratings no greater than the median average rating of horror movies, and let $\delta = R_h - R_l$ be the difference in population median revenues. Then our hypotheses are:

$\text{H}_0:\ \delta = 0$ and $\text{H}_a:\ \delta > 0.$

Our significance level will be the standard $\alpha = 0.05$.

Our test statistic will be the difference in sample median revenues, $\delta^* = \hat{R}_h - \hat{R}_l$.

Since we are doing inference about the median, a CLT-based approach is not applicable here. Thus we will be using the simulation-based approach for this hypothesis test. In particular, we will use a permutation test. This makes the assumption that our sample is a good representative sample of our population of interest.

Editor: @flor14 Reviewer: Roan Rain, Ritisha Sharma, Gaoxiang Wang

[ ] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

Data analysis review checklist

Reviewer: ritisha2000

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[ ] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1 hour

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

The documentation is great, the project is easy to follow and the objective of the project is clear. There are plenty of comments in the scripts which makes the analysis easy to follow. There is a good variety of plots for the EDA.
Some minor edits you can make include changing the names of certain files for clarity. For example, you can change “down_data.R” to “download_data.R”, and change the name of "EDA_keys" so it is clear that it is the report. The x-axis of the plot in the EDA and report files have an axis that is cramped together so it makes it a bit difficult to read.
There is a folder for images, I wondered if that belonged in the results. Also maybe rename the "notebooks" folder to "docs"
The analysis report could be more thorough. You could explain why it would be important to answer this research question in the real-world context.
The tables and plots are great but the results could be more clearly stated. I see the hypotheses but not the conclusion based on the conclusion.

Overall, it is a very well-done and interesting project!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: snesunil

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[ ] Importance: Do the authors clearly state the importance for this research question?
[ ] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Things that went well :

The project documentation is very concise and understandable, which makes it easy for the reader to interpret the goal of this data analysis project.
There is enough information about the background of the research question, limitations, and assumptions in the READ ME file.
The scripts were readable and well-written.
Great work in EDA generating visualizations covering the distributions of columns and the correlation between features.

Suggestions for improvements :

The summary of the assumptions and limitations which were mentioned in the README about how the data set is a good representative sample of the population and details about CLT could have been included in the report as they are significant points in an inferential analysis. The names of authors could be added to the final report.
The Introduction and the importance of the research questions could have been explained further in the report so that it is concrete in the reader's mind.
I feel that in the EDA script, more comments and documentation could be added so that it is easy for the reader to understand the script.
There could have been a Report section in the README file, which can be linked to the actual report so that it is easy to locate it to get a bird's eye view of the entire project including conclusions.
The proposal document is missing from the project.
The final report was named EDA_keys.ipynb, which made it hard for me to understand which was the final report. The naming of the report file could be improved. (eg : {name of the project}_report.ipynb)
It would be better to create a folder called doc in the root folder to store the documentation like the proposal.md and the final report. The images folder including all the results from the scripts could be moved to a folder called results, and the python notebooks could be moved to the src folder, which will make it easier for the reader to understand and navigate through the project.
I noticed that figure 1.3 was added twice to the final report, out of which one could be removed. The figure caption sizes could be made bigger as it is small and hard to read.

Overall, it is a well-explained and very interesting project. Great work!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: Gaoxiang Wang

Conflict of interest

[X] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[X] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[X] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelides: Does the code adhere to well known language style guides?
[X] Modularity: Is the code suitably abstracted into scripts and functions?
[X] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[X] Computational methods: Is all the source code required for the data analysis available?
[X] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[X] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[ ] Importance: Do the authors clearly state the importance for this research question?
[X] Background: Do the authors provide sufficient background information so that readers can understand the report?
[X] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[X] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[X] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Overall well done! The code is divided into sections in long scripts that is easy to follow and I have few minor sugesstions:

Some of the figures and tables in the final version of the report are missing captions, perhaps you could also make the figures and tables in the "Hypothesis Testing Results" section side by side.
Perhaps it would be better to add some graphical interpretation to the EDA analysis results in the final report.
I recommend using RMD files rather than Jupyter notebooks for reporting, as it is easier to use bib files for referencing and less problems with rendering files. In addition, I recommend using APA or MLA style with DOI in the reference section.
The report file should be in the doc folder, maybe you can rename the report folder.
Perhaps the "images" folder could be merged with the "results" folder, since "images" are the results of the EDA phase and are mentioned in the final report.

This is an interesting topic. Good job!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: @roanraina

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[ ] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[ ] Conclusions: Are the conclusions presented by the authors correct?
[ ] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1

Review Comments:

While the Results section draws a conclusion, for report structure, it would make more sense to have a separate conclusion section.
It may be beneficial to include standalone independent unit tests for each code component to ensure they function properly.
Echoing a comment above, the report should be placed into its own sub-directory... Without the link provided in this issue, it was unclear what file was the report document.
The EDA section in the README is not very informative, commenting on some broad conclusions or observations might be beneficial.
The Dataset section in the README is good and in-depth but it should not be the the first section in the proposal, and can be shortened as the README is supposed to only give a broad overview of the project.
It would be ideal for you to explain why you kept/dropped columns in model. There are many columns dropped but no explanation is provided.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Five pieces of feedback that have been implemented:

[Peer review feedback] In the EDA script, more comments and documentation could be added so that it is easy for the reader to understand the script. Commit URL: https://github.com/UBC-MDS/horror_movies/commit/b95931e7f50162474832746202418d392fbee0b5 https://github.com/UBC-MDS/horror_movies/commit/95a0645f5373fde8ef731929bb4ed8ca21bae7b4 File Changed: src/eda_horror.R
[TA feedback] Figure captions missing Commit URL: https://github.com/UBC-MDS/horror_movies/commit/bfd30bba27fbe09b092ee469cd1ac11537376606 File Changed: src/inference_horror.R
[TA feedback] The proposal.md file has been created and moved to the doc directory -2 mechanics Commit URL: https://github.com/UBC-MDS/horror_movies/commit/b21a94378f89f94f32fc0732053b754036361c6b File Changed: doc/proposal.md
[TA feedback] Plots suffer from one or more severe problems. For instance, overplotting, missing legend, small text or no axis labels Commit URL: https://github.com/UBC-MDS/horror_movies/commit/b99dbe0c7fd5642c1db521d8155fc955bfc2f64a File Changed: notebooks/Horror_movies_attributes_and_revenue_EDA.ipynb, src/eda_horror.R
[Peer review] The final report was named EDA_keys.ipynb, which made it hard for me to understand which was the final report. The naming of the report file could be improved. (eg : {name of the project}_report.ipynb) Commit URL: https://github.com/UBC-MDS/horror_movies/commit/094d7534728707b4d908465dc50ff674dc55596f File Changed: report.ipynb

UBC-MDS / data-analysis-review-2022