EdwinTh / padr

Padding of missing records in time series
https://edwinth.github.io/padr/
Other
132 stars 12 forks source link

thicken slow because of character conversion? #52

Closed DavisVaughan closed 6 years ago

DavisVaughan commented 6 years ago

Hey @EdwinTh, I'm a big fan of the package. I had a quick question. It looks like thicken() might be taking longer than it needs to for objects that inherit from POSIXt. After doing some profiling, I noticed that the function posix_to_date() in round_thicken() attempts to convert a POSIXt object to a Date object (if it can) after the thickening. This is great, but it does so by first converting to character, then to date, see this line.

Is the conversion to character necessary? I removed it and ran some speed tests. See below for the potential gains without that one conversion. A ~10x speed gain isn't bad for the 100k row dataset.

With the conversion

library(microbenchmark)
library(padr)
suppressPackageStartupMessages(library(dplyr))

microbenchmark::microbenchmark(
  thicken(emergency, "day"),
  times = 5
)
#> Unit: milliseconds
#>                       expr      min       lq     mean   median       uq
#>  thicken(emergency, "day") 988.6313 1037.317 1078.123 1066.518 1081.611
#>       max neval
#>  1216.539     5

Without the as.character() conversion.

library(microbenchmark)
library(padr)
suppressPackageStartupMessages(library(dplyr))

microbenchmark::microbenchmark(
  thicken(emergency, "day"),
  times = 5
)
#> Unit: milliseconds
#>                       expr      min      lq     mean   median       uq
#>  thicken(emergency, "day") 131.0412 133.329 189.0237 133.4806 148.5445
#>       max neval
#>  398.7233     5
EdwinTh commented 6 years ago

Thanks @DavisVaughan, I worked hard on improving thicken’s performance for v0.4, but this I definitely overlooked. Its still in time for the release, will adjust it as soon as I find the time.

EdwinTh commented 6 years ago

I retrieved why I added this (should have made a comment). Strangely enough, as.Date will set the output of thickening to the wrong date. (See the first unit tests of test_thicken_integration). This might be due to timezones, I still need to study this. Apparently, I did a quick and dirty workaround, but this thus gives performance loss. We should dig around more to find a better performing solution.

DavisVaughan commented 6 years ago

Which tests in particular? I'm pretty sure I just ran all the test_thicken_integration tests with no issues after removing the as.character(). I wouldn't doubt that some work around might be needed, but I'm not seeing it so far.

EdwinTh commented 6 years ago

Thats what I thought, you have US locale, I have CET. This is so tricky to unit test properly. My guess is that it takes a central time (like UTC) and determines the day relative to this time zone for your own time zone. I will expand the unit tests soon, so it includes multiple time zones. We can than find a better performing solution.

DavisVaughan commented 6 years ago

as.Date() has an S3 method specifically for POSIXct conversion. It supports a tz argument.

x <- as.Date(x, tz = tz(x))

This works! I changed my Sys.setenv(TZ = "CET") so that it would match yours. I confirmed the error with just using as.Date(x). With the above change, I think it works as expected.

Time zones are the bane of my existence.

EdwinTh commented 6 years ago

Beautiful, thanks for the search.

And yes time zones should be abandoned!