afogel / socqe_datachallenge

Team 3 COVID 19 data challenge
0 stars 2 forks source link

Date-related variables should be touched #1

Closed jooyoungseo closed 4 years ago

jooyoungseo commented 4 years ago

The three variables (i.e., month, week, published_date) seem to be fixed.

For example, month has to contain month factor only, week can be either numeric value indicating the N-th within a month or factor. I see this data on GitHub only includes March, is what you intended?

If we only contain March data, the month variable is not needed for our analysis.

What do you think? I am sharing the following reproducible R code below:

# Loading library:
library(ezpickr)
library(tidyverse)

# Importing data:
df <- ezpickr::pick("C:/test/socqe_datachallenge/csv/full_us_dataset.csv")
#> New names:
#> * title -> title...5
#> * title -> title...8
#> Rows: 120,760
#> Columns: 9
#> Delimiter: ","
#> chr  [5]: country, title, bias, title, content
#> dbl  [1]: number_of_shares
#> date [3]: month, week, published_date
#> 
#> Use `spec()` to retrieve the guessed column specification
#> Pass a specification to the `col_types` argument to quiet this message

# Taking a glimpse of data:
glimpse(df)
#> Rows: 120,760
#> Columns: 9
#> $ month            <date> 2020-03-01, 2020-03-01, 2020-03-01, 2020-03-01, 2...
#> $ week             <date> 2020-03-02, 2020-03-16, 2020-03-16, 2020-03-16, 2...
#> $ published_date   <date> 2020-03-02, 2020-03-18, 2020-03-16, 2020-03-19, 2...
#> $ country          <chr> "US", "US", "US", "US", "US", "US", "US", "US", "U...
#> $ title...5        <chr> "Washington Monthly", "Upworthy", "Upworthy", "Upw...
#> $ bias             <chr> "Lean Left", "Left", "Left", "Left", "Left", "Left...
#> $ number_of_shares <dbl> 0, 4618, 0, 329, 0, 0, 0, 3846, 6680, 5643, 0, 0, ...
#> $ title...8        <chr> "The Threat of the Coronavirus to the U.S. Economy...
#> $ content          <chr> "The Threat of the Coronavirus to the U.S. Economy...
head(df)
#> # A tibble: 6 x 9
#>   month      week       published_date country title...5 bias  number_of_shares
#>   <date>     <date>     <date>         <chr>   <chr>     <chr>            <dbl>
#> 1 2020-03-01 2020-03-02 2020-03-02     US      Washingt~ Lean~                0
#> 2 2020-03-01 2020-03-16 2020-03-18     US      Upworthy  Left              4618
#> 3 2020-03-01 2020-03-16 2020-03-16     US      Upworthy  Left                 0
#> 4 2020-03-01 2020-03-16 2020-03-19     US      Upworthy  Left               329
#> 5 2020-03-01 2020-03-16 2020-03-18     US      Upworthy  Left                 0
#> 6 2020-02-01 2020-02-24 2020-02-27     US      Upworthy  Left                 0
#> # ... with 2 more variables: title...8 <chr>, content <chr>
tail(df)
#> # A tibble: 6 x 9
#>   month      week       published_date country title...5 bias  number_of_shares
#>   <date>     <date>     <date>         <chr>   <chr>     <chr>            <dbl>
#> 1 2020-03-01 2020-03-23 2020-03-28     US      Hampton ~ Cent~                0
#> 2 2020-03-01 2020-03-23 2020-03-27     US      Hampton ~ Cent~                0
#> 3 2020-03-01 2020-03-23 2020-03-28     US      Hampton ~ Cent~                0
#> 4 2020-03-01 2020-03-23 2020-03-28     US      Hampton ~ Cent~                0
#> 5 2020-03-01 2020-03-23 2020-03-27     US      Hampton ~ Cent~                0
#> 6 2020-03-01 2020-03-23 2020-03-28     US      Hampton ~ Cent~                0
#> # ... with 2 more variables: title...8 <chr>, content <chr>

Created on 2020-04-30 by the reprex package (v0.3.0)

Session info ``` r devtools::session_info() #> - Session info --------------------------------------------------------------- #> setting value #> version R version 4.0.0 (2020-04-24) #> os Windows 10 x64 #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate English_United States.1252 #> ctype English_United States.1252 #> tz America/New_York #> date 2020-04-30 #> #> - Packages ------------------------------------------------------------------- #> package * version date lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.0) #> backports 1.1.6 2020-04-05 [1] CRAN (R 4.0.0) #> bit 1.1-15.2 2020-02-10 [1] CRAN (R 4.0.0) #> bit64 0.9-7 2017-05-08 [1] CRAN (R 4.0.0) #> broom 0.5.6 2020-04-20 [1] CRAN (R 4.0.0) #> callr 3.4.3 2020-03-28 [1] CRAN (R 4.0.0) #> cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.0.0) #> cli 2.0.2 2020-02-28 [1] CRAN (R 4.0.0) #> colorspace 1.4-1 2019-03-18 [1] CRAN (R 4.0.0) #> crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.0) #> DBI 1.1.0 2019-12-15 [1] CRAN (R 4.0.0) #> dbplyr 1.4.3 2020-04-19 [1] CRAN (R 4.0.0) #> desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.0) #> devtools 2.3.0 2020-04-10 [1] CRAN (R 4.0.0) #> digest 0.6.25 2020-02-23 [1] CRAN (R 4.0.0) #> dplyr * 0.8.5 2020-03-07 [1] CRAN (R 4.0.0) #> ellipsis 0.3.0 2019-09-20 [1] CRAN (R 4.0.0) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.0) #> ezpickr * 2.0.0 2019-11-17 [1] CRAN (R 4.0.0) #> fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.0) #> forcats * 0.5.0 2020-03-01 [1] CRAN (R 4.0.0) #> fs 1.4.1 2020-04-04 [1] CRAN (R 4.0.0) #> generics 0.0.2 2018-11-29 [1] CRAN (R 4.0.0) #> ggplot2 * 3.3.0 2020-03-05 [1] CRAN (R 4.0.0) #> glue 1.4.0 2020-04-03 [1] CRAN (R 4.0.0) #> gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.0) #> haven 2.2.0 2019-11-08 [1] CRAN (R 4.0.0) #> highr 0.8 2019-03-20 [1] CRAN (R 4.0.0) #> hms 0.5.3 2020-01-08 [1] CRAN (R 4.0.0) #> htmltools 0.4.0 2019-10-04 [1] CRAN (R 4.0.0) #> httr 1.4.1 2019-08-05 [1] CRAN (R 4.0.0) #> jsonlite 1.6.1 2020-02-02 [1] CRAN (R 4.0.0) #> knitr 1.28.5 2020-04-28 [1] Github (yihui/knitr@93b46ba) #> lattice 0.20-41 2020-04-02 [1] CRAN (R 4.0.0) #> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.0) #> lubridate 1.7.8 2020-04-06 [1] CRAN (R 4.0.0) #> magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.0) #> memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.0) #> modelr 0.1.6 2020-02-22 [1] CRAN (R 4.0.0) #> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.0) #> nlme 3.1-147 2020-04-13 [1] CRAN (R 4.0.0) #> pillar 1.4.3 2019-12-20 [1] CRAN (R 4.0.0) #> pkgbuild 1.0.7 2020-04-25 [1] CRAN (R 4.0.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0) #> pkgload 1.0.2 2018-10-29 [1] CRAN (R 4.0.0) #> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.0) #> processx 3.4.2 2020-02-09 [1] CRAN (R 4.0.0) #> ps 1.3.2 2020-02-13 [1] CRAN (R 4.0.0) #> purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.0.0) #> R6 2.4.1 2019-11-12 [1] CRAN (R 4.0.0) #> Rcpp 1.0.4.6 2020-04-09 [1] CRAN (R 4.0.0) #> readr * 1.3.1 2018-12-21 [1] CRAN (R 4.0.0) #> readxl 1.3.1 2019-03-13 [1] CRAN (R 4.0.0) #> remotes 2.1.1 2020-02-15 [1] CRAN (R 4.0.0) #> reprex 0.3.0 2019-05-16 [1] CRAN (R 4.0.0) #> rlang 0.4.5 2020-03-01 [1] CRAN (R 4.0.0) #> rmarkdown 2.1.3 2020-04-28 [1] Github (rstudio/rmarkdown@e2ceb35) #> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 4.0.0) #> rvest 0.3.5 2019-11-08 [1] CRAN (R 4.0.0) #> scales 1.1.0 2019-11-18 [1] CRAN (R 4.0.0) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.0) #> stringi 1.4.6 2020-02-17 [1] CRAN (R 4.0.0) #> stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.0.0) #> testthat 2.3.2 2020-03-02 [1] CRAN (R 4.0.0) #> tibble * 3.0.1 2020-04-20 [1] CRAN (R 4.0.0) #> tidyr * 1.0.2 2020-01-24 [1] CRAN (R 4.0.0) #> tidyselect 1.0.0 2020-01-27 [1] CRAN (R 4.0.0) #> tidyverse * 1.3.0 2019-11-21 [1] CRAN (R 4.0.0) #> usethis 1.6.1 2020-04-29 [1] CRAN (R 4.0.0) #> utf8 1.1.4 2018-05-24 [1] CRAN (R 4.0.0) #> vctrs 0.2.4 2020-03-10 [1] CRAN (R 4.0.0) #> vroom 1.2.0 2020-01-13 [1] CRAN (R 4.0.0) #> withr 2.2.0 2020-04-20 [1] CRAN (R 4.0.0) #> xfun 0.13.1 2020-04-30 [1] Github (yihui/xfun@bf8afdd) #> xml2 1.3.2 2020-04-23 [1] CRAN (R 4.0.0) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0) #> #> [1] C:/Program Files/R/R-4.0.0/library ```
jooyoungseo commented 4 years ago

Done! Well, do you think we need to distribute this newly touched data to our team? I think we can just include this script in our analysis as well to explain how we mangled our data.

Any thoughts, @afogel?

# Loading library:
library(ezpickr)
library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:dplyr':
#> 
#>     intersect, setdiff, union
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

# Importing data:
df <- ezpickr::pick("C:/test/socqe_datachallenge/csv/full_us_dataset.csv")
#> New names:
#> * title -> title...5
#> * title -> title...8
#> Rows: 120,760
#> Columns: 9
#> Delimiter: ","
#> chr  [5]: country, title, bias, title, content
#> dbl  [1]: number_of_shares
#> date [3]: month, week, published_date
#> 
#> Use `spec()` to retrieve the guessed column specification
#> Pass a specification to the `col_types` argument to quiet this message

df2 <- df %>%
mutate(month = month(month)) %>%
mutate(week = week(week)) %>%
arrange(published_date, month, week)

head(df2)
#> # A tibble: 6 x 9
#>   month  week published_date country title...5 bias  number_of_shares title...8
#>   <dbl> <dbl> <date>         <chr>   <chr>     <chr>            <dbl> <chr>    
#> 1    12    49 2019-12-12     US      Forbes    Cent~                3 How Prep~
#> 2    12    51 2019-12-23     US      Vox.com   Lean~                0 The 2010~
#> 3    12    51 2019-12-23     US      Vox.com   Lean~                0 Trump’s ~
#> 4    12    52 2019-12-31     US      The Atla~ Lean~                0 Photos o~
#> 5     1    52 2020-01-01     US      InfoWars  Right                0 Largest ~
#> 6     1    52 2020-01-01     US      InfoWars  Right                0 US shale~
#> # ... with 1 more variable: content <chr>
tail(df2)
#> # A tibble: 6 x 9
#>   month  week published_date country title...5 bias  number_of_shares title...8
#>   <dbl> <dbl> <date>         <chr>   <chr>     <chr>            <dbl> <chr>    
#> 1     3    12 2020-03-29     US      Atlanta ~ Lean~                0 Coronavi~
#> 2     3    12 2020-03-29     US      National~ Right                0 Coronavi~
#> 3     3    12 2020-03-29     US      National~ Right                0 Sports N~
#> 4     3    12 2020-03-29     US      National~ Right                0 Coronavi~
#> 5     3    12 2020-03-29     US      National~ Right                0 Coronavi~
#> 6     3    12 2020-03-29     US      Daily Pr~ Lean~                0 Virginia~
#> # ... with 1 more variable: content <chr>

Created on 2020-04-30 by the reprex package (v0.3.0)