UBC-MDS / online_news_popularity

Assessing factors associated with online news popularity for DSCI 522
Other
1 stars 3 forks source link

Milestone 1 Review #24

Open mohamad-amin opened 2 years ago

mohamad-amin commented 2 years ago
  1. Project set-up: Mechanics Please include the teamwork contract in your github repository.

  2. Project proposal: reasoning

    • What is 'log shares per day'? How should the know about the dataset that you're using?
    • Why is R-squared a good evaluation metric for your project? How does the reported R-squared relate to the question that you have?
    • How does the linear regression help you with answering your question?
    • Why did you choose linear regression?
    • What's your plan for visualizations and explorations that you have mentioned?
  3. A script that downloads the data: Quality

    • It's not really a good practice to use different languages in a project unless you're forced to.
  4. Exploratory data analysis in a literate code document: QUALITY You could clean the code a bit more. Try to use functions to avoid repeating the same block of code if possible.

  5. Exploratory data analysis in a literate code document: VIZ and REASONING

    • You have not mentioned any motivation for the plots that you have.
    • You have not described much about what you infer from the plots that you have. (These should be in the .ipynb EDA file, and it should be linked to in your front page.)
nrao944 commented 2 years ago

Team 8 thanks you for your feedback. Please find our responses below.

  1. Project set-up: Mechanics Please include the teamwork contract in your github repository.

OUR RESPONSE: In Milestone 1 under teamwork contract, there is a statement in bold “Note - this document is fairly personal and does NOT need to reside in your public GitHub.com repo. Instead you can prove that you created this by pasting it into the text box for your Canvas homework submission for this milestone.”. If you check our Canvas submission, you will find our Teamwork contract there.

  1. Project proposal: reasoning • What is 'log shares per day'? How should the know about the dataset that you're using? OUR RESPONSE: Thank you for the feedback. This has been updated in the README.md file.

• Why is R-squared a good evaluation metric for your project?

OUR RESPONSE: While RMSE, MSE or MAE are alternatives, they do not provide a range that will help us in determining how well our model is doing in explaining the variation in our dependent variable as a function of our set of features. R-Squared has the advantage in that it defines the degree of variance in the dependent variable that can be explained by the independent variable and is a derivative of MSE. We have updated our report discussing the appropriateness of Adjusted R-Squared in the current version of the paper as it accounts for the degrees of freedom in our model, but prefer sticking to this as the evaluation metric over MSE, RMSE, or MAE.

We have included this discussion in the README, Project Proposal, and the Report.

• How does the reported R-squared relate to the question that you have?

OUR RESPONSE: This has been answered in our Project Report: “This seems like a low R-Squared, particularly given the large number of features included in the model and their statistical significance at alpha = 0.05. This indicates that other variables that are not currently included in the model explain a large portion of the variability in our data. There is not much we can do about this problem, beyond including some interaction variables to assess if there are any interaction effects. “

• How does the linear regression help you with answering your question?

OUR RESPONSE: In this study, we are interested in what factors affect online news popularity. Thus, for every feature included in the model, we want to get a quantitative association with our measure of online news popularity. A linear regression provides feature coefficients which gives us magnitude and direction of the association, which helps answering our question.

• Why did you choose linear regression?

OUR RESPONSE: Same response as above - In this study, we are interested in what factors affect online news popularity. Thus, for every feature included in the model, we want to get a quantitative association with our measure of online news popularity. A linear regression provides feature coefficients which gives us magnitude and direction of the association, which helps answering our question and was a logical choice. Our dependent variable is continuous, so a classification model is not appropriate.

• What's your plan for visualizations and explorations that you have mentioned?

OUR RESPONSE: In Milestone 1, we had included our EDA file with a correlation plot of all features, and two bar graphs explaining variation in our dependent variable for type of news article and the day of week it was released. Since EDA can only give us a preliminary sense of the association, we plan to look at them more rigorously using a multiple linear regression model.

  1. A script that downloads the data: Quality • It's not really a good practice to use different languages in a project unless you're forced to.

OUR RESPONSE: Thank you for your feedback, but we would like to politely disagree on this feedback. In the real world, not everyone in a team will work using one software package. Some team members have comparative advantage with using specific software. Furthermore, certain types of analyses are better handled by different platforms. In our case, Python was better for EDA as R was crashing each time we tried to produce our correlation plot. For regression analysis, R is better since scikit-learn does not automatically produce a tidy version of the regression results, and producing the same table would require long and windy codes.

  1. Exploratory data analysis in a literate code document: QUALITY You could clean the code a bit more. Try to use functions to avoid repeating the same block of code if possible.

OUR RESPONSE: Thank you for the feedback. Upon discussion with Florencia, the goal of this assignment is not to have efficient codes, but understand data science workflows and how to work within a team. While the use of functions is a valuable suggestion, we do not view this as being central to the fundamental course objective of DSCI 522.

  1. Exploratory data analysis in a literate code document: VIZ and REASONING • You have not mentioned any motivation for the plots that you have. • You have not described much about what you infer from the plots that you have. (These should be in the .ipynb EDA file, and it should be linked to in your front page.)

OUR RESPONSE: Thank you for the feedback. These are now included in the report and have been updated in the EDA as well.

LINK TO OUR PROPOSAL: https://github.com/UBC-MDS/online_news_popularity/blob/main/doc/proposal.md