bnosac / cronR

A simple R package for managing your cron jobs.
Other
288 stars 38 forks source link

Dataframe produced through cronR has wrong format. #26

Closed nicocriscuolo closed 4 years ago

nicocriscuolo commented 4 years ago

I've created a script that reads data from two GitHub repositories, reformats the datasets, binds them together by rows and then writes everything in a new .csv file. Then, I scheduled the run of this script every hour through the functionalities of the cronR package.

Here's my code:

devtools::install_github("tidyverse/googlesheets4")

library(dplyr)
library(googlesheets4)
library(RCurl)

setwd(dir = "YOUR_WORKING_DIRECTORY")

###############################################################################
#================== TIME SERIES DATA FOR CASES AND DEATHS ====================#
###############################################################################

# 1. #####==== DATASETS =====#####

# 1.1 ###= Cases #####

# These files are updated on GitHub every day.
cases <- read.csv(text = getURL(url = "https://raw.githubusercontent.com/openZH/covid_19/master/COVID19_Cases_Cantons_CH_total.csv"),
                  header = TRUE,
                  stringsAsFactors = FALSE,
                  na.strings = c("", "NA"),
                  encoding = "UTF-8")

# Removed data for whole Switzerland and Leichtenstein
cases <- subset(x = cases,
                !is.element(el = canton,
                            set = c("CH", "FL")),
                select = c("date",
                           "canton",
                           "tested_pos"))

names(cases)[1] <- "Date"

# Dataset restructured according to the cases dataset format
cases <- reshape(data = cases,
                 idvar = "Date",
                 timevar = "canton",
                 v.names = "tested_pos",
                 direction = "wide",
                 )

names(cases) <- gsub(pattern = "tested_pos.",
                     replacement = "",
                     x = names(cases))

cases[is.na(cases)] <- 0

cases <- cases[order(cases$Date,
                     decreasing = FALSE), ]

# More updated dataset
cases2 <- read.csv(text = getURL(url = "https://raw.githubusercontent.com/daenuprobst/covid19-cases-switzerland/master/covid19_cases_switzerland.csv"),
                   header = TRUE,
                   stringsAsFactors = FALSE,
                   na.strings = c("", "NA"),
                   encoding = "UTF-8")

# Remove total daily cases for Switzerland
cases2 <- subset(x = cases2,
                 select = -c(CH))

# rbind between two cases datasets
cases_tot <- bind_rows(cases[1:7, ],
                       cases2)

rownames(cases_tot) <- seq(from = 1,
                           to = nrow(cases_tot),
                           by = 1)

write.csv(x = cases_tot,
          file = paste0(getwd(),
                        "/cases_tot.csv"),
          row.names = FALSE,
          quote = FALSE)

When I manually run my script everything is ok and the .csv produced is fine, but if you try to schedule the run of this script through the cronR package (from RStudio IDE click on Addins -> Schedule R scripts on Linux/Unix) the .csv saved is different just for the column "Date". In fact, the dates of the first dataset are on the first column, but the dates of the second dataset (to bind to the first through bind_rows()) are at the end of the dataset, and the header has a new strange name (as you can see from this image).

Moreover, if I use rbind() instead, the script launched through cronR just fail to compile because it catches an error with rbind.

Do you have any idea of what could be the problem? Thanks a lot!

P.S.: I work on a MacBook Pro late 2016, 8 Gb of RAM, with macOS Catalina installed.

jwijffels commented 4 years ago

This is not an error of this package. If you want to inspect where your own code is failing, save an RData file and inspect your objects.

nicocriscuolo commented 4 years ago

I'm sorry but have you tried to run my script? If you do it you can see that the code is not failing and there is no error at all, and the .csv produced is correct, with the column "Date" in the right format. Just if I schedule the script the resulting dataset has the wrong format.

jwijffels commented 4 years ago

Yes. It 'fails' under your definition of failure.

> # 1. #####==== DATASETS =====#####
> 
> # 1.1 ###= Cases #####
> 
> # These files are updated on GitHub every day.
> cases <- read.csv(text = getURL(url = "https://raw.githubusercontent.com/openZH/covid_19/master/COVID19_Cases_Cantons_CH_total.csv"),
+                   header = TRUE,
+                   stringsAsFactors = FALSE,
+                   na.strings = c("", "NA"),
+                   encoding = "UTF-8")
> 
> # Removed data for whole Switzerland and Leichtenstein
> cases <- subset(x = cases,
+                 !is.element(el = canton,
+                             set = c("CH", "FL")),
+                 select = c("date",
+                            "canton",
+                            "tested_pos"))
> 
> names(cases)[1] <- "Date"
> 
> # Dataset restructured according to the cases dataset format
> cases <- reshape(data = cases,
+                  idvar = "Date",
+                  timevar = "canton",
+                  v.names = "tested_pos",
+                  direction = "wide",
+ )
> 
> names(cases) <- gsub(pattern = "tested_pos.",
+                      replacement = "",
+                      x = names(cases))
> 
> cases[is.na(cases)] <- 0
> 
> cases <- cases[order(cases$Date,
+                      decreasing = FALSE), ]
> 
> # More updated dataset
> cases2 <- read.csv(text = getURL(url = "https://raw.githubusercontent.com/daenuprobst/covid19-cases-switzerland/master/covid19_cases_switzerland.csv"),
+                    header = TRUE,
+                    stringsAsFactors = FALSE,
+                    na.strings = c("", "NA"),
+                    encoding = "UTF-8")
> 
> # Remove total daily cases for Switzerland
> cases2 <- subset(x = cases2,
+                  select = -c(CH))
> 
> # rbind between two cases datasets
> cases_tot <- bind_rows(cases[1:7, ],
+                        cases2)
> 
> rownames(cases_tot) <- seq(from = 1,
+                            to = nrow(cases_tot),
+                            by = 1)
> cases_tot
         Date  AG AI AR  BE  BL  BS  FR   GE GL  GR JU  LU  NE NW OW  SG SH SO SZ TG   TI UR   VD  VS ZG   ZH X.U.FEFF.Date
1  2020-02-28   0  0  0   0   1   0   0    0  0   0  0   0   0  0  0   0  0  0  0  0    0  0    0   0  0    0          <NA>
2  2020-02-29   0  0  0   0   2   0   0    0  0   0  0   0   0  0  0   0  0  0  0  0    0  0    0   0  0    0          <NA>
3  2020-03-01   0  0  0   0   2   0   0    0  0   0  0   0   0  0  0   0  0  0  0  0    0  0    0   0  0    0          <NA>
4  2020-03-02   0  0  0   0   2   0   0    0  0   0  0   0   0  0  0   0  0  0  0  0    0  0    0   0  0    0          <NA>
5  2020-03-03   0  0  0   0   2   0   0    0  0   0  0   0   0  0  0   0  0  0  0  0    0  0    0   0  0    0          <NA>
6  2020-03-04   0  0  0   0   2   0   0    0  0   0  0   0   0  0  0   0  0  0  0  0    0  0    0   0  0    0          <NA>
7  2020-03-05   0  0  0   0   6   0   0    0  0   0  0   0   0  0  0   0  0  0  0  0    0  0    0   0  0    0          <NA>
8        <NA>  10  0  1  20   9  17   7   15  0  13  1   2   8  0  0   1  0  1  7  1   37  0   23   4  5   24    2020-03-06
9        <NA>  14  0  1  25  13  22   7   24  0  15  3   4  11  0  0   1  0  1  7  1   43  0   30   5  6   28    2020-03-07
10       <NA>  14  0  1  31  19  25   8   32  0  17  3   6  13  0  0   3  0  1  7  1   58  0   40   5  7   34    2020-03-08
11       <NA>  14  0  2  34  20  29  10   33  0  17  3   5  17  0  0   3  0  1  7  2   67  0   51   7  7   36    2020-03-09
12       <NA>  15  0  2  39  22  39  12   59  0  18  4   6  27  0  0   7  0  1  7  3   91  0   77  15  7   45    2020-03-10
13       <NA>  17  0  2  41  23  49  16   66  2  23  5   7  27  3  0  10  1  3  8  4  131  0  108  18  7   55    2020-03-11
14       <NA>  22  0  2  51  29  80  26   80  2  28  7   8  33  4  0  14  1  7  9  5  170  0  156  23  7   91    2020-03-12
15       <NA>  28  0  5  62  42 111  31  104  4  39  7  13  37  5  7  19  1  8 10  5  218  2  222  29  7  140    2020-03-13
16       <NA>  31  2  5  78  48 119  38  196  5  47 10  19  39  5  8  26  1 10 13  5  262  2  273  47  9  148    2020-03-14
17       <NA>  NA NA NA  NA  NA  NA  NA  281 NA  NA NA  NA  NA NA NA  NA NA NA 13 NA   NA NA  406  NA NA   NA    2020-03-15
18       <NA>  52 NA NA 131  NA 144  NA  373 NA  NA NA  NA  74 10 NA  35 NA NA NA 17  330 NA  508  NA NA  270    2020-03-16
19       <NA>  67 NA NA  NA  NA 165  NA  495 NA  NA 23  50  77 12 NA  47 NA NA NA 23  422 NA  608  95 NA  294    2020-03-17
20       <NA> 101  3 10 193 116 182  67  629 10 116 25  48  99 18 16  61  6 31 29 32  511  5  796 129 17  424    2020-03-18
21       <NA> 118  3 16 282 139 222  82  826 16 145 27  63 132 25 17  85  8 39 35 36  638  7 1212 232 23  526    2020-03-19
22       <NA> 168  3 18 377 184 272 109  994 18 159 29  92 159 28 17  98 14 46 39 49  834  7 1432 282 48  773    2020-03-20
23       <NA> 175  3 22 418 282 299 136 1128 21 239 49 109 177 33 19 137 17 58 56 56  918 12 1676 359 NA   NA    2020-03-21
24       <NA> 232  3 26  NA 289 358 162 1203 24 257 51 131 188 36 20 163 28 68 70 75  939 12 1782 432 NA  891    2020-03-22
25       <NA> 241  4 30 470 302 376 189 1231 29 266 57 156 204 39 25 185 30 95 73 81 1165 22 1880 492 62 1068    2020-03-23
nicocriscuolo commented 4 years ago
Screenshot 2020-03-24 at 08 55 06

I'm sorry, but this doesn't happen in my RStudio console. Any idea on how to figure out why this happens? Thanks

jwijffels commented 4 years ago

Please ask questions about your code issues on other platforms. This is not related to the cronR package.