Closed DataStrategist closed 7 years ago
In general it's already possible to join by timestamp using the difference_inner_join
function (see the below examples- this works because subtracting datetimes and treating it as numeric gets a difference in seconds).
However, what that question was asking for specifically was to find joins where one column was between two others, which is trickier and isn't something fuzzyjoin is currently designed for (it would require somewhat different syntax). It's not specific to timestamps so I'll close this issue, but may return to the problem!
library(readr)
library(fuzzyjoin)
d1 <- read_csv("row1,time1
1,2016-11-14 20:00:00
2,2016-11-14 21:00:00
3,2016-11-14 22:00:00")
d2 <- read_csv("row2,time2
1,2016-11-14 20:00:05
2,2016-11-14 22:00:10
3,2016-11-14 21:00:05")
# at most five seconds apart
difference_inner_join(d1, d2, by = c(time1 = "time2"), max_dist = 5)
#> # A tibble: 2 × 4
#> row1 time1 row2 time2
#> <int> <dttm> <int> <dttm>
#> 1 1 2016-11-14 20:00:00 1 2016-11-14 20:00:05
#> 2 2 2016-11-14 21:00:00 3 2016-11-14 21:00:05
# at most ten seconds apart
difference_inner_join(d1, d2, by = c(time1 = "time2"), max_dist = 10)
#> # A tibble: 3 × 4
#> row1 time1 row2 time2
#> <int> <dttm> <int> <dttm>
#> 1 1 2016-11-14 20:00:00 1 2016-11-14 20:00:05
#> 2 2 2016-11-14 21:00:00 3 2016-11-14 21:00:05
#> 3 3 2016-11-14 22:00:00 2 2016-11-14 22:00:10
# at most one second apart
difference_inner_join(d1, d2, by = c(time1 = "time2"), max_dist = 1)
#> # A tibble: 0 × 4
#> # ... with 4 variables: row1 <int>, time1 <dttm>, row2 <int>, time2 <dttm>
# at most an hour apart
difference_inner_join(d1, d2, by = c(time1 = "time2"), max_dist = 60 * 60)
#> # A tibble: 5 × 4
#> row1 time1 row2 time2
#> <int> <dttm> <int> <dttm>
#> 1 1 2016-11-14 20:00:00 1 2016-11-14 20:00:05
#> 2 2 2016-11-14 21:00:00 1 2016-11-14 20:00:05
#> 3 2 2016-11-14 21:00:00 3 2016-11-14 21:00:05
#> 4 3 2016-11-14 22:00:00 2 2016-11-14 22:00:10
#> 5 3 2016-11-14 22:00:00 3 2016-11-14 21:00:05
It's true, and difference_inner_join is close... but in fact, it would be enough if we could just specify to always round-down. THat way, it would pick the nearest timestamp (as long as it's greater). If that condition were applied AND if the timestamps are arranged chronologically, it would be sufficient, no?
I saw this on SO: http://stackoverflow.com/questions/40632295/how-do-i-select-values-from-a-lookup-table-based-on-their-time-stamp, where the asker would like to match based on ranges... but it gave me the idea... your fuzzyjoin could also match based on closest timestamp. Could be a good feature for this package, no?