Closed jooyoungseo closed 4 years ago
Done! Well, do you think we need to distribute this newly touched data to our team? I think we can just include this script in our analysis as well to explain how we mangled our data.
Any thoughts, @afogel?
# Loading library:
library(ezpickr)
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:dplyr':
#>
#> intersect, setdiff, union
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
# Importing data:
df <- ezpickr::pick("C:/test/socqe_datachallenge/csv/full_us_dataset.csv")
#> New names:
#> * title -> title...5
#> * title -> title...8
#> Rows: 120,760
#> Columns: 9
#> Delimiter: ","
#> chr [5]: country, title, bias, title, content
#> dbl [1]: number_of_shares
#> date [3]: month, week, published_date
#>
#> Use `spec()` to retrieve the guessed column specification
#> Pass a specification to the `col_types` argument to quiet this message
df2 <- df %>%
mutate(month = month(month)) %>%
mutate(week = week(week)) %>%
arrange(published_date, month, week)
head(df2)
#> # A tibble: 6 x 9
#> month week published_date country title...5 bias number_of_shares title...8
#> <dbl> <dbl> <date> <chr> <chr> <chr> <dbl> <chr>
#> 1 12 49 2019-12-12 US Forbes Cent~ 3 How Prep~
#> 2 12 51 2019-12-23 US Vox.com Lean~ 0 The 2010~
#> 3 12 51 2019-12-23 US Vox.com Lean~ 0 Trump’s ~
#> 4 12 52 2019-12-31 US The Atla~ Lean~ 0 Photos o~
#> 5 1 52 2020-01-01 US InfoWars Right 0 Largest ~
#> 6 1 52 2020-01-01 US InfoWars Right 0 US shale~
#> # ... with 1 more variable: content <chr>
tail(df2)
#> # A tibble: 6 x 9
#> month week published_date country title...5 bias number_of_shares title...8
#> <dbl> <dbl> <date> <chr> <chr> <chr> <dbl> <chr>
#> 1 3 12 2020-03-29 US Atlanta ~ Lean~ 0 Coronavi~
#> 2 3 12 2020-03-29 US National~ Right 0 Coronavi~
#> 3 3 12 2020-03-29 US National~ Right 0 Sports N~
#> 4 3 12 2020-03-29 US National~ Right 0 Coronavi~
#> 5 3 12 2020-03-29 US National~ Right 0 Coronavi~
#> 6 3 12 2020-03-29 US Daily Pr~ Lean~ 0 Virginia~
#> # ... with 1 more variable: content <chr>
Created on 2020-04-30 by the reprex package (v0.3.0)
The three variables (i.e., month, week, published_date) seem to be fixed.
For example,
month
has to contain month factor only,week
can be either numeric value indicating the N-th within a month or factor. I see this data on GitHub only includes March, is what you intended?If we only contain March data, the
month
variable is not needed for our analysis.What do you think? I am sharing the following reproducible R code below:
Created on 2020-04-30 by the reprex package (v0.3.0)
Session info
``` r devtools::session_info() #> - Session info --------------------------------------------------------------- #> setting value #> version R version 4.0.0 (2020-04-24) #> os Windows 10 x64 #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate English_United States.1252 #> ctype English_United States.1252 #> tz America/New_York #> date 2020-04-30 #> #> - Packages ------------------------------------------------------------------- #> package * version date lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.0) #> backports 1.1.6 2020-04-05 [1] CRAN (R 4.0.0) #> bit 1.1-15.2 2020-02-10 [1] CRAN (R 4.0.0) #> bit64 0.9-7 2017-05-08 [1] CRAN (R 4.0.0) #> broom 0.5.6 2020-04-20 [1] CRAN (R 4.0.0) #> callr 3.4.3 2020-03-28 [1] CRAN (R 4.0.0) #> cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.0.0) #> cli 2.0.2 2020-02-28 [1] CRAN (R 4.0.0) #> colorspace 1.4-1 2019-03-18 [1] CRAN (R 4.0.0) #> crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.0) #> DBI 1.1.0 2019-12-15 [1] CRAN (R 4.0.0) #> dbplyr 1.4.3 2020-04-19 [1] CRAN (R 4.0.0) #> desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.0) #> devtools 2.3.0 2020-04-10 [1] CRAN (R 4.0.0) #> digest 0.6.25 2020-02-23 [1] CRAN (R 4.0.0) #> dplyr * 0.8.5 2020-03-07 [1] CRAN (R 4.0.0) #> ellipsis 0.3.0 2019-09-20 [1] CRAN (R 4.0.0) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.0) #> ezpickr * 2.0.0 2019-11-17 [1] CRAN (R 4.0.0) #> fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.0) #> forcats * 0.5.0 2020-03-01 [1] CRAN (R 4.0.0) #> fs 1.4.1 2020-04-04 [1] CRAN (R 4.0.0) #> generics 0.0.2 2018-11-29 [1] CRAN (R 4.0.0) #> ggplot2 * 3.3.0 2020-03-05 [1] CRAN (R 4.0.0) #> glue 1.4.0 2020-04-03 [1] CRAN (R 4.0.0) #> gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.0) #> haven 2.2.0 2019-11-08 [1] CRAN (R 4.0.0) #> highr 0.8 2019-03-20 [1] CRAN (R 4.0.0) #> hms 0.5.3 2020-01-08 [1] CRAN (R 4.0.0) #> htmltools 0.4.0 2019-10-04 [1] CRAN (R 4.0.0) #> httr 1.4.1 2019-08-05 [1] CRAN (R 4.0.0) #> jsonlite 1.6.1 2020-02-02 [1] CRAN (R 4.0.0) #> knitr 1.28.5 2020-04-28 [1] Github (yihui/knitr@93b46ba) #> lattice 0.20-41 2020-04-02 [1] CRAN (R 4.0.0) #> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.0) #> lubridate 1.7.8 2020-04-06 [1] CRAN (R 4.0.0) #> magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.0) #> memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.0) #> modelr 0.1.6 2020-02-22 [1] CRAN (R 4.0.0) #> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.0) #> nlme 3.1-147 2020-04-13 [1] CRAN (R 4.0.0) #> pillar 1.4.3 2019-12-20 [1] CRAN (R 4.0.0) #> pkgbuild 1.0.7 2020-04-25 [1] CRAN (R 4.0.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.0) #> pkgload 1.0.2 2018-10-29 [1] CRAN (R 4.0.0) #> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.0) #> processx 3.4.2 2020-02-09 [1] CRAN (R 4.0.0) #> ps 1.3.2 2020-02-13 [1] CRAN (R 4.0.0) #> purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.0.0) #> R6 2.4.1 2019-11-12 [1] CRAN (R 4.0.0) #> Rcpp 1.0.4.6 2020-04-09 [1] CRAN (R 4.0.0) #> readr * 1.3.1 2018-12-21 [1] CRAN (R 4.0.0) #> readxl 1.3.1 2019-03-13 [1] CRAN (R 4.0.0) #> remotes 2.1.1 2020-02-15 [1] CRAN (R 4.0.0) #> reprex 0.3.0 2019-05-16 [1] CRAN (R 4.0.0) #> rlang 0.4.5 2020-03-01 [1] CRAN (R 4.0.0) #> rmarkdown 2.1.3 2020-04-28 [1] Github (rstudio/rmarkdown@e2ceb35) #> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 4.0.0) #> rvest 0.3.5 2019-11-08 [1] CRAN (R 4.0.0) #> scales 1.1.0 2019-11-18 [1] CRAN (R 4.0.0) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.0) #> stringi 1.4.6 2020-02-17 [1] CRAN (R 4.0.0) #> stringr * 1.4.0 2019-02-10 [1] CRAN (R 4.0.0) #> testthat 2.3.2 2020-03-02 [1] CRAN (R 4.0.0) #> tibble * 3.0.1 2020-04-20 [1] CRAN (R 4.0.0) #> tidyr * 1.0.2 2020-01-24 [1] CRAN (R 4.0.0) #> tidyselect 1.0.0 2020-01-27 [1] CRAN (R 4.0.0) #> tidyverse * 1.3.0 2019-11-21 [1] CRAN (R 4.0.0) #> usethis 1.6.1 2020-04-29 [1] CRAN (R 4.0.0) #> utf8 1.1.4 2018-05-24 [1] CRAN (R 4.0.0) #> vctrs 0.2.4 2020-03-10 [1] CRAN (R 4.0.0) #> vroom 1.2.0 2020-01-13 [1] CRAN (R 4.0.0) #> withr 2.2.0 2020-04-20 [1] CRAN (R 4.0.0) #> xfun 0.13.1 2020-04-30 [1] Github (yihui/xfun@bf8afdd) #> xml2 1.3.2 2020-04-23 [1] CRAN (R 4.0.0) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.0) #> #> [1] C:/Program Files/R/R-4.0.0/library ```