EdwinTh / padr

Padding of missing records in time series
https://edwinth.github.io/padr/
Other
132 stars 12 forks source link

does padr has a maximum limit on year? #51

Closed dareneiri closed 3 years ago

dareneiri commented 6 years ago

It seems that padr has a maximum limit on year which can be processed. Some datasets, like the MIMICIII database have datetimes shifted in the future randomly. So some years are set in 2100 for example. If the year is greater than 20 from the current year, then padr cannot thicken.

For now it seems that I can subtract year since it's not relevant to my analysis.

If I try to thicken the data without changing the year, then I get an error: Here's some sample data

> packageVersion("tidyverse")
[1] ‘1.1.1’
> packageVersion("lubridate")
[1] ‘1.6.0’
> packageVersion("padr")
[1] ‘0.3.0’
> library(tidyverse)
> library(lubridate)
> library(padr)
> 
> df <- read.csv("padr_data.csv")
> df <- mutate_at(df, vars(ends_with("time")), funs(ymd_hms(., tz = "UTC", locale = Sys.getlocale("LC_TIME"))- dyears(63)))
> 
> df$sbp <- as.numeric(df$sbp)
> 
> summary(df)
   charttime                        sbp       
 Min.   :2038-11-04 18:30:00   Min.   : 62.0  
 1st Qu.:2038-11-04 19:33:45   1st Qu.: 84.5  
 Median :2038-11-04 20:52:30   Median : 95.0  
 Mean   :2038-11-04 21:08:22   Mean   :100.9  
 3rd Qu.:2038-11-04 22:26:15   3rd Qu.:102.0  
 Max.   :2038-11-05 00:42:00   Max.   :217.0  
                               NA's   :12     
> lapply(df, class)
$charttime
[1] "POSIXct" "POSIXt" 

$sbp
[1] "numeric"

> df$charttime %>% get_interval
[1] "min"
> 
> # this does not work
> df[!is.na(df$charttime),] %>%
+   thicken(interval = 'hour')
Error in if (to_date) x <- as.Date(x, tz = attr(x, "tzone")) : 
  missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In round_down_core(a, b) : NAs introduced by coercion to integer range
2: In round_down_core(a, b) : NAs introduced by coercion to integer range

Change dyears(63) to dyears(64)

> df <- mutate_at(df, vars(ends_with("time")), funs(ymd_hms(., tz = "UTC", locale = Sys.getlocale("LC_TIME"))- dyears(64)))
> 
> df$sbp <- as.numeric(df$sbp)
> 
> summary(df)
   charttime                        sbp       
 Min.   :2037-11-04 18:30:00   Min.   : 62.0  
 1st Qu.:2037-11-04 19:33:45   1st Qu.: 84.5  
 Median :2037-11-04 20:52:30   Median : 95.0  
 Mean   :2037-11-04 21:08:22   Mean   :100.9  
 3rd Qu.:2037-11-04 22:26:15   3rd Qu.:102.0  
 Max.   :2037-11-05 00:42:00   Max.   :217.0  
                               NA's   :12     
> lapply(df, class)
$charttime
[1] "POSIXct" "POSIXt" 

$sbp
[1] "numeric"

> df$charttime %>% get_interval
[1] "min"
> 
> # this does work
> df[!is.na(df$charttime),] %>%
+   thicken(interval = 'hour')
             charttime sbp      charttime_hour
1  2037-11-04 18:30:00  NA 2037-11-04 18:00:00
2  2037-11-04 18:45:00  62 2037-11-04 18:00:00
3  2037-11-04 19:00:00  66 2037-11-04 19:00:00
4  2037-11-04 19:12:00  NA 2037-11-04 19:00:00
5  2037-11-04 19:14:00  NA 2037-11-04 19:00:00
6  2037-11-04 19:15:00 217 2037-11-04 19:00:00
7  2037-11-04 19:26:00  NA 2037-11-04 19:00:00
8  2037-11-04 19:30:00 102 2037-11-04 19:00:00
9  2037-11-04 19:45:00  94 2037-11-04 19:00:00
10 2037-11-04 19:59:00  NA 2037-11-04 19:00:00
11 2037-11-04 20:00:00  80 2037-11-04 20:00:00
12 2037-11-04 20:04:00  NA 2037-11-04 20:00:00
13 2037-11-04 20:15:00  91 2037-11-04 20:00:00
14 2037-11-04 20:30:00  86 2037-11-04 20:00:00
15 2037-11-04 20:45:00  96 2037-11-04 20:00:00
16 2037-11-04 21:00:00  73 2037-11-04 21:00:00
17 2037-11-04 21:15:00  84 2037-11-04 21:00:00
18 2037-11-04 21:30:00  96 2037-11-04 21:00:00
19 2037-11-04 21:45:00 100 2037-11-04 21:00:00
20 2037-11-04 21:51:00  NA 2037-11-04 21:00:00
21 2037-11-04 22:00:00  NA 2037-11-04 22:00:00
22 2037-11-04 22:15:00 123 2037-11-04 22:00:00
23 2037-11-04 22:30:00 125 2037-11-04 22:00:00
24 2037-11-04 22:45:00 132 2037-11-04 22:00:00
25 2037-11-04 23:00:00  88 2037-11-04 23:00:00
26 2037-11-04 23:15:00  NA 2037-11-04 23:00:00
27 2037-11-04 23:45:00  NA 2037-11-04 23:00:00
28 2037-11-05 00:00:00 102 2037-11-05 00:00:00
29 2037-11-05 00:28:00  NA 2037-11-05 00:00:00
30 2037-11-05 00:42:00  NA 2037-11-05 00:00:00
EdwinTh commented 6 years ago

Thank you for informing me. It honestly never occurred to me to check so far in the future. I am not sure if this is a padr thing or that it is due to the underlying R POSIX mechanism. I will dig into it as soon schedule allows.

Blundys commented 5 years ago

I was also having this problem. I chased it down to round_down_core.cpp or round_up_core.cpp so looks like its a c++ problem. Works prior to 19th Jan 2038 and not after so looks like its related to the Year 2038 problem

EdwinTh commented 5 years ago

Thanks for your digging, hope to schedule some time for maintenance soon to look further into it.

EdwinTh commented 5 years ago

Looked into it, it is indeed the year 2038 problem. Meaning that when using POSIXt there is integer overflow from a moment in this year. Alas, research seemed to show there is no universal fix for 32bit machines. Switching to int64 would result in the package only working on 64bit machines, which I am reluctant to do. For the moment I tend towards an informed error and leaving the work around for the user.

EdwinTh commented 5 years ago

These are the unit tests that should pass once the problem is resolved.

a <- as.numeric(ymd_h(c("20380601 00", "20390601 00")))
b <- as.numeric(ymd_h(c("20380101 00", "20390101 00", "20400101 00")))

test_that("round_down_core works after 2038 in posix", {
  expect_equal(round_down_core(a, b), b[1:2])
})

test_that("round_down_core_prev works after 2038 in posix", {
  expect_equal(round_down_core_prev(a, b), b[1:2])
})

test_that("round_up_core works after 2038 in posix", {
  expect_equal(round_up_core(a, b), b[2:3])
})

test_that("round_down_core_prev works after 2038 in posix", {
  expect_equal(round_up_core_prev(a, b), b[2:3])
})
Blundys commented 5 years ago

Thanks for looking into it. Yeah at least with an informative error people would understand what has gone wrong so it a good idea at least in the short term