ATFutures / calendar

R interface to iCal (.ics files)
https://atfutures.github.io/calendar/
Other
41 stars 10 forks source link

Eurostat issue #55

Closed serkor1 closed 2 months ago

serkor1 commented 2 months ago

Hi,

I have a slight issue with parsing calendar data from Eurostat using ic_dataframe and/or ic_read. This issue only arises when I use {calendar}, there is no issue when using {ical}.

The issue is as follows; the dates are not parsing correctly when using ic_dataframe or ic_read, but this is not an issue when using ical::ical_parse_df. The issue can be mitigated, however, by using a mix of ic_list, lapply and do.call. See the MWE below,

rm(list = ls()); gc();
#>           used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells  629306 33.7    1434458 76.7   727117 38.9
#> Vcells 1108619  8.5    8388608 64.0  1972973 15.1

ical <- readLines(
  "https://ec.europa.eu/eurostat/o/calendars/eventsIcal?theme=2&category=2"
)
head(
  DT <- calendar::ic_dataframe(
    ical
  )$`DTSTART;VALUE=DATE`
)
#> Warning in `[<-.data.frame`(`*tmp*`, date_cols, value = list(structure(19363,
#> class = "Date"), : provided 125 variables to replace 1 variables
#> [1] "2023-01-06" "2023-01-06" "2023-01-06" "2023-01-06" "2023-01-06"
#> [6] "2023-01-06"
ical_list <- calendar::ic_list(
  x = ical
)

head(
  DT <- do.call(
    rbind,
    lapply(
      ical_list,
      function(element){
        # 1) remove the last
        # element as it is a bunch
        # of html codes
        element <- element[-length(element)]

        # 2) split element
        split_element <- strsplit(
          element,
          split = ":"
        )

        do.call(
          cbind,
          lapply(
            split_element,
            function(x){

              DT <- data.frame(
                value = x[2]
              )

              names(DT) <- x[1]

              DT

            }
          )
        )

      }
    )
  )$`DTSTART;VALUE=DATE`
)
#> [1] "20230106" "20230106" "20230110" "20230111" "20230111" "20230118"
# package
head(
  DT <- ical::ical_parse_df(
    text = ical
  )$start
)
#> [1] "1970-01-01 01:00:00 CET" "2023-01-06 01:00:00 CET"
#> [3] "2023-01-06 01:00:00 CET" "2023-01-10 01:00:00 CET"
#> [5] "2023-01-11 01:00:00 CET" "2023-01-11 01:00:00 CET"

Created on 2024-08-07 with reprex v2.1.0

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.4.1 (2024-06-14) #> os Zorin OS 17.1 #> system x86_64, linux-gnu #> ui X11 #> language en_US:en #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Europe/Copenhagen #> date 2024-08-07 #> pandoc 3.1.11 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/x86_64/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> calendar 0.1.0 2024-04-28 [1] CRAN (R 4.4.1) #> cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.1) #> curl 5.2.1 2024-03-01 [1] CRAN (R 4.4.0) #> digest 0.6.36 2024-06-23 [1] CRAN (R 4.4.1) #> evaluate 0.24.0 2024-06-10 [1] CRAN (R 4.4.0) #> fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0) #> fs 1.6.4 2024-04-25 [1] CRAN (R 4.4.0) #> glue 1.7.0 2024-01-09 [1] CRAN (R 4.4.0) #> htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0) #> ical 0.1.6 2019-01-21 [1] CRAN (R 4.4.1) #> jsonlite 1.8.8 2023-12-04 [1] CRAN (R 4.4.0) #> knitr 1.47 2024-05-29 [1] CRAN (R 4.4.0) #> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0) #> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.4.0) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.4.0) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.4.0) #> R.oo 1.26.0 2024-01-24 [1] CRAN (R 4.4.0) #> R.utils 2.12.3 2023-11-18 [1] CRAN (R 4.4.0) #> Rcpp 1.0.12 2024-01-09 [1] CRAN (R 4.4.0) #> reprex 2.1.0 2024-01-11 [3] CRAN (R 4.4.0) #> rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.0) #> rmarkdown 2.27 2024-05-17 [1] CRAN (R 4.4.0) #> rstudioapi 0.16.0 2024-03-24 [3] CRAN (R 4.4.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0) #> styler 1.10.3 2024-04-07 [1] CRAN (R 4.4.0) #> V8 4.4.2 2024-02-15 [2] CRAN (R 4.4.0) #> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0) #> withr 3.0.0 2024-01-16 [1] CRAN (R 4.4.0) #> xfun 0.45 2024-06-16 [1] CRAN (R 4.4.0) #> yaml 2.3.8 2023-12-11 [1] CRAN (R 4.4.0) #> #> [1] /home/serkan/R/x86_64-pc-linux-gnu-library/4.4 #> [2] /usr/local/lib/R/site-library #> [3] /usr/lib/R/site-library #> [4] /usr/lib/R/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
Robinlovelace commented 2 months ago

Many thanks for the reproducible example @serkor1. I don't have any time to look at this right now, do you have any ideas for a fix?

serkor1 commented 2 months ago

Hi @Robinlovelace - I actually don't as I am new to the package.

But I would be happy to browse and explore and possibly post a fix during the weekend to assist the development, if you want!

Robinlovelace commented 2 months ago

But I would be happy to browse and explore and possibly post a fix during the weekend to assist the development, if you want!

That would be amazing, any questions you have just let me know, also cc package co-author @layik 🙏

serkor1 commented 2 months ago

Actually, it's a trivial fix it seems. See the code below,

https://github.com/ATFutures/calendar/blob/bfd95e0cced2a9977b6c7b9b38502e9ec6557006/R/ic_dataframe.R#L38C3-L44C4

You are assigning x_df[date_cols] <- lapply(x_df[, date_cols], ic_date) which assigns a list to the date_cols. The fix is to extract the columns as data.frames, using the following code,

x_df[date_cols] <- lapply(x_df[date_cols], ic_date)

Full solution:

ic_dataframe <- function(x) {

  if(methods::is(object = x, class2 = "data.frame")) {
    return(x)
  }

  stopifnot(methods::is(object = x, class2 = "character") | methods::is(object = x, class2 = "list"))

  if(methods::is(object = x, class2 = "character")) {
    x_list <- ic_list(x)
  } else if(methods::is(object = x, class2 = "list")) {
    x_list <- x
  }

  x_list_named <- lapply(x_list, function(x) {
    ic_vector(x)
  })

  x_df <- ic_bind_list(x_list_named)

  date_cols <- grepl(pattern = "VALUE=DATE", x = names(x_df))

  if(any(date_cols)) {
    x_df[date_cols] <- lapply(x_df[date_cols], ic_date)
  }
  datetime_cols <- names(x_df) %in% c("DTSTART", "DTEND")
  if(any(datetime_cols)) {
    x_df[datetime_cols] <- lapply(x_df[datetime_cols], ic_datetime)
  }

  # names(x_df) <- gsub(pattern = ".VALUE.DATE", replacement = "", names(x_df))

  x_df
}

Showcase the solution

# library
rm(list = ls()); gc(); devtools::load_all()

# read ical
ical <- readLines(
  "https://ec.europa.eu/eurostat/o/calendars/eventsIcal?theme=2&category=2"
)

# convert to data.frame
DT <- ic_dataframe(
  ical
)

# check dates
head(DT$`DTSTART;VALUE=DATE`)

# > "2023-01-06" "2023-01-06" "2023-01-10" "2023-01-11" "2023-01-11" "2023-01-18"

Which is what we want. I have run devtools::check() which runs without errors! The fix should be safe and trivial to implement!

Edit: Assuming that the helper functions works without any issues I believe this solution is robust. I have tested this by adding a few more date variables after locating the offending lines of code!

Robinlovelace commented 2 months ago

Great work, super simple fix, thank you so much! So this line and perhaps one other need to change?

https://github.com/ATFutures/calendar/blob/bfd95e0cced2a9977b6c7b9b38502e9ec6557006/R/ic_dataframe.R#L39

You should be able to edit the file and put in a PR here: ...

Robinlovelace commented 2 months ago

https://github.com/ATFutures/calendar/edit/master/R/ic_dataframe.R#L39

serkor1 commented 2 months ago

Yes, both needs to be changed. I didn't want to do a PR for such a trivial fix!

But I'll do one later this evening, unless you wan't to fix it right now 😃

Great package by the way, really love it!

Robinlovelace commented 2 months ago

Awesome! Yes, will await your input, as a learning experience. Your idea so will be good to have your name on the fix 👍