dgrtwo / fuzzyjoin

Join tables together on inexact matching
Other
664 stars 62 forks source link

Feature: fuzzyjoin based on approximate timestamps #19

Closed DataStrategist closed 7 years ago

DataStrategist commented 7 years ago

I saw this on SO: http://stackoverflow.com/questions/40632295/how-do-i-select-values-from-a-lookup-table-based-on-their-time-stamp, where the asker would like to match based on ranges... but it gave me the idea... your fuzzyjoin could also match based on closest timestamp. Could be a good feature for this package, no?

dgrtwo commented 7 years ago

In general it's already possible to join by timestamp using the difference_inner_join function (see the below examples- this works because subtracting datetimes and treating it as numeric gets a difference in seconds).

However, what that question was asking for specifically was to find joins where one column was between two others, which is trickier and isn't something fuzzyjoin is currently designed for (it would require somewhat different syntax). It's not specific to timestamps so I'll close this issue, but may return to the problem!

library(readr)
library(fuzzyjoin)

d1 <- read_csv("row1,time1
1,2016-11-14 20:00:00
2,2016-11-14 21:00:00
3,2016-11-14 22:00:00")

d2 <- read_csv("row2,time2
1,2016-11-14 20:00:05
2,2016-11-14 22:00:10
3,2016-11-14 21:00:05")

# at most five seconds apart
difference_inner_join(d1, d2, by = c(time1 = "time2"), max_dist = 5)
#> # A tibble: 2 × 4
#>    row1               time1  row2               time2
#>   <int>              <dttm> <int>              <dttm>
#> 1     1 2016-11-14 20:00:00     1 2016-11-14 20:00:05
#> 2     2 2016-11-14 21:00:00     3 2016-11-14 21:00:05

# at most ten seconds apart
difference_inner_join(d1, d2, by = c(time1 = "time2"), max_dist = 10)
#> # A tibble: 3 × 4
#>    row1               time1  row2               time2
#>   <int>              <dttm> <int>              <dttm>
#> 1     1 2016-11-14 20:00:00     1 2016-11-14 20:00:05
#> 2     2 2016-11-14 21:00:00     3 2016-11-14 21:00:05
#> 3     3 2016-11-14 22:00:00     2 2016-11-14 22:00:10

# at most one second apart
difference_inner_join(d1, d2, by = c(time1 = "time2"), max_dist = 1)
#> # A tibble: 0 × 4
#> # ... with 4 variables: row1 <int>, time1 <dttm>, row2 <int>, time2 <dttm>

# at most an hour apart
difference_inner_join(d1, d2, by = c(time1 = "time2"), max_dist = 60 * 60)
#> # A tibble: 5 × 4
#>    row1               time1  row2               time2
#>   <int>              <dttm> <int>              <dttm>
#> 1     1 2016-11-14 20:00:00     1 2016-11-14 20:00:05
#> 2     2 2016-11-14 21:00:00     1 2016-11-14 20:00:05
#> 3     2 2016-11-14 21:00:00     3 2016-11-14 21:00:05
#> 4     3 2016-11-14 22:00:00     2 2016-11-14 22:00:10
#> 5     3 2016-11-14 22:00:00     3 2016-11-14 21:00:05
DataStrategist commented 7 years ago

It's true, and difference_inner_join is close... but in fact, it would be enough if we could just specify to always round-down. THat way, it would pick the nearest timestamp (as long as it's greater). If that condition were applied AND if the timestamps are arranged chronologically, it would be sufficient, no?