Submission: Group 8: Online News Popularity

UBC-MDS / data-analysis-review-2021

1 stars 4 forks source link

Submission: Group 8: Online News Popularity #10

Closed jennifer-hoang closed 2 years ago

jennifer-hoang commented 2 years ago

Submitting authors: @jennifer-hoang @nrao944 @lipcai

Repository: https://github.com/UBC-MDS/online_news_popularity Report link: https://github.com/UBC-MDS/online_news_popularity/blob/main/doc/report.pdf Abstract/executive summary: The market space for online news has grown manifold, with traditional news outlets also providing an online version of their articles. This growth has been complemented with increased competition, particularly from non-traditional news outlets (such as small-scale digital news companies) who can reach a wider audience through social-media. For this project we aimed to answer the following inferential research question: What factors are associated with online news popularity, measured by the number of times an article has been shared?. Our findings will be relevant to news organizations that are interested in boosting traction of their online news articles and may help guide management decisions on the characteristics associated with more popular articles.

Using a multiple linear regression analysis with 'log shares per day' as our response variable, our model achieved an R-squared score of 0.2132. This indicates that additional features not included in our current model explain a large portion of variability in the data. Future steps for our analysis include exploring the contribution of interaction effects in our model, as well as other regression models such as random forest regression.

Editor: @jennifer-hoang @nrao944 @lipcai Reviewer: @sukhleen999 @harryyikhchan @alexYinanGu0 @hatefr

[x] I agree to abide by MDS's Code of Conduct during the review process and in maintaining my package should it be accepted.

harryyikhchan commented 2 years ago

Data analysis review checklist

Reviewer: @harryyikhchan

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[ ] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[ ] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[ ] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Nicely done, everyone! Overall, the scripts and the report are very well-structured. Please find my comments in the following.

Suggest to add a flow chart / workflow diagram in README so the readers can have a better understanding of the overall data flow with adequate graphical representation.
In the final report, Data Section can be added to give a summary about the dataset (e.g. the number of columns, the type or description of the columns and total observations) that is used for the analysis of this project.
Seems you are addressing a statistics inference problem. It would be more informative if you can state your null hypothesis and alternative hypothesis and conclude the statement based on your findings.
In the analysis source code, it seems that you repeatedly constructed several models and generated the data format. I will suggest to write a model wrapper function that follows DRY principle.
I tried to run src/regression_online_news_popularity.R on my local computer but found there are a few missing packages (i.e caret, car, feather). Please add them in the Dependencies section.

I cannot execute the Rscript to generate the report. Below is the error message I captured from the terminal.

Quitting from lines 19-21 (report.Rmd) 
Error in gzfile(file, "rb") : cannot open the connection
Calls: <Anonymous> ... withCallingHandlers -> withVisible -> eval -> eval -> readRDS -> gzfile

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

alexYinanGu0 commented 2 years ago

Data analysis review checklist

Reviewer: @alexYinanGu0

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[ ] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelides: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[ ] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[ ] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: around 1.5 hours

Review Comments:

Good job! The structure of the repo is clean and organized. The research topic is interesting! The report is overall well written. It would be better if some details are added. For example,

the GitHub repo could be added instead of leaving it blank at the top;
some feature names are hard to read;
figure and table names are repeated.

Some tests would be appreciated for the list of dependencies. Some documentation for your functions would be helpful in the scripts.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

sukhleen999 commented 2 years ago

Data analysis review checklist

Reviewer: @sukhleen999

Conflict of interest

[x] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[x] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[x] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[x] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[x] Installation instructions: Is there a clearly stated list of dependencies?
[x] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[x] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[x] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[x] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[x] Style guidelines: Does the code adhere to well known language style guides?
[x] Modularity: Is the code suitably abstracted into scripts and functions?
[x] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robustness?

Reproducibility

[x] Data: Is the raw data archived somewhere? Is it accessible?
[x] Computational methods: Is all the source code required for the data analysis available?
[x] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[x] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[x] Authors: Does the report include a list of authors with their affiliations?
[x] What is the question: Do the authors clearly state the research question being asked?
[x] Importance: Do the authors clearly state the importance for this research question?
[x] Background: Do the authors provide sufficient background information so that readers can understand the report?
[x] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[x] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[x] Conclusions: Are the conclusions presented by the authors correct?
[x] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[x] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1.5 hours

Review Comments:

Great work, team! The research question is really interesting and the process workflow is well structured. I am listing a few of my observations/suggestions below:

It can be helpful to include print statements in the scripts to track their progress while running them in the terminal.
The pdf for the final report has repeated figure captions.
The correlation matrix could be plotted for top relevant features to reduce over-visualization
A command to directly create the environment file could help users set up the required dependencies for the project easily.
Function docstring for the merge_data_column function in the eda.py script should include the input parameter and the returned object and their data types.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

hatefr commented 2 years ago

Data analysis review checklist

Reviewer: hatefr

Conflict of interest

[X] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[X] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[X] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[X] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelides: Does the code adhere to well known language style guides?
[X] Modularity: Is the code suitably abstracted into scripts and functions?
[ ] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[X] Computational methods: Is all the source code required for the data analysis available?
[X] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[X] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[ ] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[X] Importance: Do the authors clearly state the importance for this research question?
[X] Background: Do the authors provide sufficient background information so that readers can understand the report?
[X] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[X] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[X] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: around 1.5 hours

Review Comments:

I would like to thank the authors for the great work in addressing a very interesting inferential inquiry. In general, I was very pleased with the project/repo structure, and only have some minor suggestion as follows:

I would encourage the authors to add the rationale for choice of 0.7 as a correlation coefficient threshold for the feature selection and different models. On multiple occasions, authors refer to this number to build a new LR model. Perhaps, this is a standard in Statistics and LR that I am not aware of, but it would be nice to expand on that.
I would also like to see some tests for the scripts that are currently missing. For example, an assertion test to ensure that the input file is csv would be really helpful for users to reproduce the analysis.
I would also suggest creating a subfolder called notebooks in the src folder and move all the notebook files from the results folder to there. Generally, it is a good practice to keep the results folder clean with only figures and tables.
Some minor details in the report: the font sizes in some of the figures are really small, please consider using larger font sizes (although it may be difficult given the large number of features you have). Also, please add affiliations for the authors. 'Table 2' and 'model' are mentioned twice in the caption of Table 2.
Also as suggested by other reviewers, in the analysis script you wrote the code for each model repeatedly. That made the script very long and hard to understand for the users. I would suggest re-writing some parts of the code with DRY principles in mind.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

nrao944 commented 2 years ago

Data analysis review checklist

Reviewer: @harryyikhchan

Review Comments:

Nicely done, everyone! Overall, the scripts and the report are very well-structured. Please find my comments in the following. Thank you very much for taking the time to review our project. Your insights are truly appreciated.

Suggest to add a flow chart / workflow diagram in README so the readers can have a better understanding of the overall data flow with adequate graphical representation.

Thank you for this wonderful suggestion. We have now included it in the README (a3f1bbb) and it indeed facilitates a better understanding of the overall data flow through graphical representation.

In the final report, Data Section can be added to give a summary about the dataset (e.g. the number of columns, the type or description of the columns and total observations) that is used for the analysis of this project.

Thank you, and a “Data” section has now been added to the Project Report (112ab85).

Seems you are addressing a statistics inference problem. It would be more informative if you can state your null hypothesis and alternative hypothesis and conclude the statement based on your findings.

Thank you for this recommendation. We have now included the null and alternative hypothesis now (7e330c2)

In the analysis source code, it seems that you repeatedly constructed several models and generated the data format. I will suggest to write a model wrapper function that follows DRY principle.

Thank you for this suggestion. We have now streamlined our codes (to the extent possible) following DRY principle.

I tried to run src/regression_online_news_popularity.R on my local computer but found there are a few missing packages (i.e caret, car, feather). Please add them in the Dependencies section.

These were already included in the README (Dependencies section) in the version that was submitted that you reviewed.

I cannot execute the Rscript to generate the report. Below is the error message I captured from the terminal.
Quitting from lines 19-21 (report.Rmd) 
Error in gzfile(file, "rb") : cannot open the connection
Calls: <Anonymous> ... withCallingHandlers -> withVisible -> eval -> eval -> readRDS -> gzfile

Thank you for this feedback. This likely happened since some of the packages that might have been required are not installed on your end. That said, we now have it run through Docker (9716867), so environment should not be a problem anymore.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

nrao944 commented 2 years ago

Data analysis review checklist

Reviewer: @alexYinanGu0

Review Comments:

Good job! The structure of the repo is clean and organized. The research topic is interesting! The report is overall well written. It would be better if some details are added. For example,

Thank you very much for taking the time to review our project. Your insights are truly appreciated.

the GitHub repo could be added instead of leaving it blank at the top;

Thank you for flagging this. As of Milestone 3, the GitHub repo link is included at the top of the report. (1f437b7)

some feature names are hard to read;

Thank you. We have now included a variable description PDF file for reference. (1c78f0e).

figure and table names are repeated.

Thank you. This was a problem with PDF, but now our report knits as HTML and this is no longer a problem (86c6e1a).

Some tests would be appreciated for the list of dependencies. Some documentation for your functions would be helpful in the scripts.

Thank you for this suggestion. Given that we have now migrated to Docker for Milestone 4 (9716867), we believe that all dependencies are now taken care of.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

nrao944 commented 2 years ago

Data analysis review checklist

Reviewer: @sukhleen999

Review Comments:

Great work, team! The research question is really interesting and the process workflow is well structured. I am listing a few of my observations/suggestions below:

Thank you very much for taking the time to review our project. Your insights are truly appreciated.

It can be helpful to include print statements in the scripts to track their progress while running them in the terminal.

Thank you. Some print statements have now been included.

The pdf for the final report has repeated figure captions.

Thank you. This was a problem with PDF, but now our report knits as HTML and this is no longer a problem (86c6e1a).

The correlation matrix could be plotted for top relevant features to reduce over-visualization

Thank you for this suggestion. We are using the correlation as a starting point to determine what features should be included in order to avoid multi-collinearity. We follow this up with VIF analysis to ensure our model controls multicollinearity. This is in line with what was covered in Lecture 8 of DSCI 561.

A command to directly create the environment file could help users set up the required dependencies for the project easily.

Thank you for this suggestion. Given that we have now migrated to Docker for Milestone 4 (9716867), we believe that all environment issues will now be taken care of. You can also see the updated Docker instructions here: (0240b36)

Function docstring for the merge_data_column function in the eda.py script should include the input parameter and the returned object and their data types.

Thank you for the suggestion. Since this is an automated set up, the input parameter has been automatically inserted into our code.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

nrao944 commented 2 years ago

Data analysis review checklist

Reviewer: hatefr

Review Comments:

I would like to thank the authors for the great work in addressing a very interesting inferential inquiry. In general, I was very pleased with the project/repo structure, and only have some minor suggestion as follows:

Thank you very much for taking the time to review our project. Your insights are truly appreciated.

I would encourage the authors to add the rationale for choice of 0.7 as a correlation coefficient threshold for the feature selection and different models. On multiple occasions, authors refer to this number to build a new LR model. Perhaps, this is a standard in Statistics and LR that I am not aware of, but it would be nice to expand on that.

Thank you for this feedback. Since we were not taught multicollinearity in DSCI 561 by the time we submitted Milestone 3 (multicollinearity is only covered in Module 8), but were quite aware that this might be a problem in our dataset, we went with a conservative assumption of 0.7 as our threshold for elimination and then filtered down the subset of features using VIF’s. Our intuition was about right, and in DSCI 561, we actually have only learned about keeping only one of the variables pairs that are most correlated (in both directions). Our method, while at the cost of eliminating a few features, minimizes any coefficient biases that may be observed by keeping more correlated variables. In our preliminary analysis, we also included all correlated features, but did not find model performance (measured by Adjusted R2) improve.

I would also like to see some tests for the scripts that are currently missing. For example, an assertion test to ensure that the input file is csv would be really helpful for users to reproduce the analysis.

Thank you for this feedback. We think that the directions within our py and Rscripts now are detailed enough to guide the users on input file types and other such concerns.

I would also suggest creating a subfolder called notebooks in the src folder and move all the notebook files from the results folder to there. Generally, it is a good practice to keep the results folder clean with only figures and tables.

Thank you for this feedback. Given that tables and figures are their own folders, and the notebook files are not core to the automated Docker setup, we think the current folder structure is alright.

Some minor details in the report: the font sizes in some of the figures are really small, please consider using larger font sizes (although it may be difficult given the large number of features you have). Also, please add affiliations for the authors. 'Table 2' and 'model' are mentioned twice in the caption of Table 2.

The affiliations of the authors have now been added (f4ef026) .

Also as suggested by other reviewers, in the analysis script you wrote the code for each model repeatedly. That made the script very long and hard to understand for the users. I would suggest re-writing some parts of the code with DRY principles in mind.

Thank you for this suggestion. We have now streamlined our codes (to the extent possible given time constraints) following DRY principle.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.