Closed Carenser closed 5 years ago
Exactly what I need as well. Just tried this:
x = fuzzy_left_join(x, y, by = c("FullName" = "Name", "x_event_date"= "StartDate", "x_event_date" = "EndDate"), match_fun = list('==', '>=', '<'))
and got this error:
Error in mf(rep(u_x, n_y), rep(u_y, each = n_x), ...) : could not find function "mf"
I think it's because you're using simple quote in your match_fun list. You should use this character to bord your operators -> `
I'm talking about something like this : x = interval_left_join(x, y, by = c("FullName" = "Name", "startEventDate"= "StartDate", "endEventDate" = "EndDate")) But it's actually not possible to join intervals with more than start and end columns.
Good catch on the match_fun list. When I changed the single quotes to the back tick `, my example worked!
As for the -> vs. = , I don't think it should matter. Our groups coding style decided on the = which is why I use it. I get the same results using either assignment operator.
@MichaelSzczepaniak thanks for pointing me here on stackoverflow.
Are you saying that changing the single quotes to backticks resolved all issues?
Like this?
x = fuzzy_left_join(x, y, by = c("FullName" = "Name", "x_event_date"= "StartDate", "x_event_date" = "EndDate"), matchfun = list((backticks here)_==
, >=
, <
))
Short answer: Yes. Long answer: Can't say it resolved all issue, but can say it resolved my issue. My example does a fuzzy join on three fields and it gave me the expected results after I changed my single quotes to back ticks.
FYI, if what you're looking for is a group that has to match exactly plus an interval that has to overlap, you can do this with genome_join
. This is designed for the case of chromosomes matching along with intervals overlapping, but it should work even if the group wasn't a chromosome, e.g.:
genome_left_join(x, y, by = c("FullName" = "Name", "x_event_date"= "StartDate", "x_event_date" = "EndDate")
(I haven't tested this with dates so you may need to turn it into an integer first). Like interval_join
, this gets the speed advantages from the IRanges package, so it may be a lot faster than the custom matching described above.
If it needs to be more complicated, you can use the custom matching logic described in this issue.
I can confirm that this method works after using as.numeric
on the start and end date variables and that it is much faster than methods I previously used, such as
fuzzy_left_join(x, y, by = c("id" = "id", "end_date" = "interval_date"), match_fun = list(`==`, `%within%`))
OR
fuzzy_left_join(x, y, by = c("id" = "id", "interval_date1" = "interval_date2"), match_fun = list(`==`, `int_overlaps`))
which in my experience took up to 20 minutes and did not successfully join all cases.
Short answer: Yes. Long answer: Can't say it resolved all issue, but can say it resolved my issue. My example does a fuzzy join on three fields and it gave me the expected results after I changed my single quotes to back ticks.
Same here!
Hello everyone, thanks for sharing your answers here.
fuzzy_left_join(x, y, by = c("FullName" = "Name",
"x_event_date"= "StartDate", "x_event_date" = "EndDate"),
match_fun = list(`==`, `>=`, `<`))
Did anyone look at extending the interval?
from "x_event_date" >= "StartDate"
to "x_event_date" >= "StartDate" - 1
from "x_event_date" >= "EndDate"
to "x_event_date" < "EndDate" + 3
Is there any argument already available or need to write a custom match_fun
?
Edit: Something like this as a custom function?
match_fun_left <- function(v1, v2) {
ret <- data.frame(include = (v1 >= v2 - 1))
ret
}
match_fun_right <- function(v1, v2) {
ret <- data.frame(include = (v1 < v2 + 3))
ret
}
fuzzy_left_join(x, y, by = c("FullName" = "Name",
"x_event_date"= "StartDate", "x_event_date" = "EndDate"),
match_fun = list(`==`, match_fun_left, match_fun_right))
It would be great if the interval join methods in fuzzyjoin would also run with more than 2 columns start and end, like a group for example.