dgrtwo / fuzzyjoin

Join tables together on inexact matching
Other
664 stars 62 forks source link

Using interval join methods with more than 2 columns (start and end) #50

Closed Carenser closed 5 years ago

Carenser commented 5 years ago

It would be great if the interval join methods in fuzzyjoin would also run with more than 2 columns start and end, like a group for example.

MichaelSzczepaniak commented 5 years ago

Exactly what I need as well. Just tried this:

x = fuzzy_left_join(x, y, by = c("FullName" = "Name", "x_event_date"= "StartDate", "x_event_date" = "EndDate"), match_fun = list('==', '>=', '<'))

and got this error:

Error in mf(rep(u_x, n_y), rep(u_y, each = n_x), ...) : could not find function "mf"

Carenser commented 5 years ago

I think it's because you're using simple quote in your match_fun list. You should use this character to bord your operators -> `

I'm talking about something like this : x = interval_left_join(x, y, by = c("FullName" = "Name", "startEventDate"= "StartDate", "endEventDate" = "EndDate")) But it's actually not possible to join intervals with more than start and end columns.

MichaelSzczepaniak commented 5 years ago

Good catch on the match_fun list. When I changed the single quotes to the back tick `, my example worked!

As for the -> vs. = , I don't think it should matter. Our groups coding style decided on the = which is why I use it. I get the same results using either assignment operator.

ahcyip commented 5 years ago

@MichaelSzczepaniak thanks for pointing me here on stackoverflow. Are you saying that changing the single quotes to backticks resolved all issues? Like this? x = fuzzy_left_join(x, y, by = c("FullName" = "Name", "x_event_date"= "StartDate", "x_event_date" = "EndDate"), matchfun = list((backticks here)_==, >=, <))

(https://stackoverflow.com/questions/44383510/r-passing-arguments-into-the-match-fun-function-in-fuzzyjoinfuzzy-join/44872674#comment94283973_44872674)

MichaelSzczepaniak commented 5 years ago

Short answer: Yes. Long answer: Can't say it resolved all issue, but can say it resolved my issue. My example does a fuzzy join on three fields and it gave me the expected results after I changed my single quotes to back ticks.

dgrtwo commented 5 years ago

FYI, if what you're looking for is a group that has to match exactly plus an interval that has to overlap, you can do this with genome_join. This is designed for the case of chromosomes matching along with intervals overlapping, but it should work even if the group wasn't a chromosome, e.g.:

genome_left_join(x, y, by = c("FullName" = "Name", "x_event_date"= "StartDate", "x_event_date" = "EndDate")

(I haven't tested this with dates so you may need to turn it into an integer first). Like interval_join, this gets the speed advantages from the IRanges package, so it may be a lot faster than the custom matching described above.

If it needs to be more complicated, you can use the custom matching logic described in this issue.

HaydenMacDonald commented 5 years ago

I can confirm that this method works after using as.numeric on the start and end date variables and that it is much faster than methods I previously used, such as

fuzzy_left_join(x, y, by = c("id" = "id", "end_date" = "interval_date"), match_fun = list(`==`, `%within%`))

OR

fuzzy_left_join(x, y, by = c("id" = "id", "interval_date1" = "interval_date2"), match_fun = list(`==`, `int_overlaps`))

which in my experience took up to 20 minutes and did not successfully join all cases.

Armandjg commented 5 years ago

Short answer: Yes. Long answer: Can't say it resolved all issue, but can say it resolved my issue. My example does a fuzzy join on three fields and it gave me the expected results after I changed my single quotes to back ticks.

Same here!

peranti commented 3 years ago

Hello everyone, thanks for sharing your answers here.

fuzzy_left_join(x, y, by = c("FullName" = "Name", 
                             "x_event_date"= "StartDate", "x_event_date" = "EndDate"), 
                match_fun = list(`==`, `>=`, `<`))

Did anyone look at extending the interval?

from "x_event_date" >= "StartDate" to "x_event_date" >= "StartDate" - 1 from "x_event_date" >= "EndDate" to "x_event_date" < "EndDate" + 3

Is there any argument already available or need to write a custom match_fun?

Edit: Something like this as a custom function?

match_fun_left <- function(v1, v2) {
  ret <- data.frame(include = (v1 >= v2 - 1))
  ret
}

match_fun_right <- function(v1, v2) {
  ret <- data.frame(include = (v1 < v2 + 3))
  ret
}

fuzzy_left_join(x, y, by = c("FullName" = "Name", 
                             "x_event_date"= "StartDate", "x_event_date" = "EndDate"), 
                match_fun = list(`==`, match_fun_left, match_fun_right))