dgrtwo / fuzzyjoin

Join tables together on inexact matching
Other
668 stars 61 forks source link

distance_join and geo_join examples #5

Closed maelle closed 8 years ago

maelle commented 8 years ago

Hi,

I cannot get the examples to work.

First I had to write data("state") instead of data("state.center").

Then I get the error Joining by: c("longitude", "latitude") Error: No variables selected when running

states <- data_frame(state = state.name,
                     longitude = state.center$x,
                     latitude = state.center$y)

s1 <- rename(states, state1 = state)
s2 <- rename(states, state2 = state)

pairs <- s1 %>%
 geo_inner_join(s2, max_dist = 200) %>%
 filter(state1 != state2)
dgrtwo commented 8 years ago

That's odd, I can't reproduce the error. Could you share the results of sessionInfo() and also of traceback() right after the error?

maelle commented 8 years ago

I hope I'm not bothering you with a stupid mistake on my side. ;-)

Traceback

33: stop("No variables selected", call. = FALSE)
32: distinct_vars(.data, ..., .dots = .dots, .keep_all = .keep_all)
31: distinct_.data.frame(dplyr::select_(data, .dots = group_cols))
30: NextMethod()
29: as_data_frame(data)
28: tbl_df(NextMethod())
27: distinct_.tbl_df(dplyr::select_(data, .dots = group_cols))
26: dplyr::distinct_(dplyr::select_(data, .dots = group_cols))
25: nest_impl(data, key_col, group_cols, nest_cols)
24: nest_.data.frame(data, key_col, nest_cols)
23: NextMethod()
22: as_data_frame(data)
21: dplyr::tbl_df(NextMethod())
20: nest_.tbl_df(data, key_col, nest_cols)
19: nest_(data, key_col, nest_cols)
18: tidyr::nest(., indices)
17: function_list[[i]](value)
16: freduce(value, `_function_list`)
15: `_fseq`(`_lhs`)
14: eval(expr, envir, enclos)
13: eval(quote(`_fseq`(`_lhs`)), env, env)
12: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
11: x %>% dplyr::select_(.dots = by$x) %>% dplyr::mutate(indices = seq_len(nrow(x))) %>% 
        tidyr::nest(indices) %>% dplyr::mutate(indices = purrr::map(data, 
        "indices"))
10: fuzzy_join(x, y, multi_by = by, multi_match_fun = match_fun, 
        mode = mode)
9: geo_join(x, y, by, max_dist = max_dist, method = method, mode = "inner")
8: geo_inner_join(., s2, max_dist = 200)
7: function_list[[i]](value)
6: freduce(value, `_function_list`)
5: `_fseq`(`_lhs`)
4: eval(expr, envir, enclos)
3: eval(quote(`_fseq`(`_lhs`)), env, env)
2: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
1: s1 %>% geo_inner_join(s2, max_dist = 200) %>% filter(state1 != 
       state2)

Session Info

R version 3.2.2 (2015-08-14) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1

locale: [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C
[5] LC_TIME=Spanish_Spain.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] dplyr_0.4.3.9001 fuzzyjoin_0.1

loaded via a namespace (and not attached): [1] lazyeval_0.1.10.9000 magrittr_1.5 R6_2.1.2 assertthat_0.1 tools_3.2.2 DBI_0.3.1.9008 tibble_1.0-1
[8] Rcpp_0.12.4 stringi_1.0-1 stringr_1.0.0 tidyr_0.4.1

maelle commented 8 years ago

I must be doing something wrong, this is what I get for another type of join ("normal" dplyr joins work)

> # but we can regex match them
> diamonds %>%
+  regex_inner_join(d, by = c(cut = "regex_name"))
Error: No variables selected
> traceback()
38: stop("No variables selected", call. = FALSE)
37: distinct_vars(.data, ..., .dots = .dots, .keep_all = .keep_all)
36: distinct_.data.frame(dplyr::select_(data, .dots = group_cols))
35: NextMethod()
34: as_data_frame(data)
33: tbl_df(NextMethod())
32: distinct_.tbl_df(dplyr::select_(data, .dots = group_cols))
31: dplyr::distinct_(dplyr::select_(data, .dots = group_cols))
30: nest_impl(data, key_col, group_cols, nest_cols)
29: nest_.data.frame(data, key_col, nest_cols)
28: NextMethod()
27: as_data_frame(data)
26: dplyr::tbl_df(NextMethod())
25: nest_.tbl_df(data, key_col, nest_cols)
24: nest_(data, key_col, nest_cols)
23: tidyr::nest(., indices)
22: function_list[[i]](value)
21: freduce(value, `_function_list`)
20: `_fseq`(`_lhs`)
19: eval(expr, envir, enclos)
18: eval(quote(`_fseq`(`_lhs`)), env, env)
17: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
16: dplyr::data_frame(col = col_x, indices = seq_along(col_x)) %>% 
        tidyr::nest(indices) %>% dplyr::mutate(indices = purrr::map(data, 
        "indices"))
15: FUN(X[[i]], ...)
14: lapply(seq_along(by$x), function(i) {
        col_x <- x[[by$x[i]]]
        col_y <- y[[by$y[i]]]
        indices_x <- dplyr::data_frame(col = col_x, indices = seq_along(col_x)) %>% 
            tidyr::nest(indices) %>% dplyr::mutate(indices = purrr::map(data, 
            "indices"))
        indices_y <- dplyr::data_frame(col = col_y, indices = seq_along(col_y)) %>% 
            tidyr::nest(indices) %>% dplyr::mutate(indices = purrr::map(data, 
            "indices"))
        u_x <- indices_x$col
        u_y <- indices_y$col
        if (!is.null(names(match_fun))) {
            mf <- match_fun[[by$x[[i]]]]
        }
        else {
            mf <- match_fun[[i]]
        }
        m <- outer(u_x, u_y, mf, ...)
        w <- which(m) - 1
        if (length(w) == 0) {
            ret <- dplyr::data_frame(i = numeric(0), x = numeric(0), 
                y = numeric(0))
            return(ret)
        }
        n_x <- length(u_x)
        x_indices_l <- indices_x$indices[w%%n_x + 1]
        y_indices_l <- indices_y$indices[w%/%n_x + 1]
        xls <- purrr::map_dbl(x_indices_l, length)
        yls <- purrr::map_dbl(y_indices_l, length)
        x_rep <- unlist(purrr::map2(x_indices_l, yls, function(x, 
            y) rep(x, each = y)))
        y_rep <- unlist(purrr::map2(y_indices_l, xls, function(y, 
            x) rep(y, x)))
        dplyr::data_frame(i = i, x = x_rep, y = y_rep)
    })
13: list_or_dots(...)
12: dplyr::bind_rows(lapply(seq_along(by$x), function(i) {
        col_x <- x[[by$x[i]]]
        col_y <- y[[by$y[i]]]
        indices_x <- dplyr::data_frame(col = col_x, indices = seq_along(col_x)) %>% 
            tidyr::nest(indices) %>% dplyr::mutate(indices = purrr::map(data, 
            "indices"))
        indices_y <- dplyr::data_frame(col = col_y, indices = seq_along(col_y)) %>% 
            tidyr::nest(indices) %>% dplyr::mutate(indices = purrr::map(data, 
            "indices"))
        u_x <- indices_x$col
        u_y <- indices_y$col
        if (!is.null(names(match_fun))) {
            mf <- match_fun[[by$x[[i]]]]
        }
        else {
            mf <- match_fun[[i]]
        }
        m <- outer(u_x, u_y, mf, ...)
        w <- which(m) - 1
        if (length(w) == 0) {
            ret <- dplyr::data_frame(i = numeric(0), x = numeric(0), 
                y = numeric(0))
            return(ret)
        }
        n_x <- length(u_x)
        x_indices_l <- indices_x$indices[w%%n_x + 1]
        y_indices_l <- indices_y$indices[w%/%n_x + 1]
        xls <- purrr::map_dbl(x_indices_l, length)
        yls <- purrr::map_dbl(y_indices_l, length)
        x_rep <- unlist(purrr::map2(x_indices_l, yls, function(x, 
            y) rep(x, each = y)))
        y_rep <- unlist(purrr::map2(y_indices_l, xls, function(y, 
            x) rep(y, x)))
        dplyr::data_frame(i = i, x = x_rep, y = y_rep)
    }))
11: fuzzy_join(x, y, by = by, match_fun = match_fun, mode = mode)
10: regex_join(x, y, by, mode = "inner")
9: regex_inner_join(., d, by = c(cut = "regex_name"))
8: function_list[[k]](value)
7: withVisible(function_list[[k]](value))
6: freduce(value, `_function_list`)
5: `_fseq`(`_lhs`)
4: eval(expr, envir, enclos)
3: eval(quote(`_fseq`(`_lhs`)), env, env)
2: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
1: diamonds %>% regex_inner_join(d, by = c(cut = "regex_name"))
maelle commented 8 years ago

Just saw this https://gist.github.com/mdsumner/13ebc4240fba6aedd9a2186ec476ca25 -- so maybe it is a dplyr problem (I had written tidyr which was wrong).

maelle commented 8 years ago

and this http://stackoverflow.com/questions/36191236/r-dplyr-distinct-accross-all-columns

maelle commented 8 years ago

Seems like this https://github.com/hadley/dplyr/commit/dbc811fcaac2d583836dd02b43934e7b265c1c06 is the problem. It's called by dplyr::nest.

maelle commented 8 years ago

Well my minimal reproducing example is actually expand(mtcars, nesting(vs, cyl)) so it's not a fuzzyjoin issue, sorry!

maelle commented 8 years ago

https://github.com/hadley/tidyr/issues/178

maelle commented 8 years ago

Next time I'll google a little bit more before opening an issue ;-)

dgrtwo commented 8 years ago

No, this was very helpful!

It looks like this is only development dplyr and has been fixed before dplyr 0.5 is being submitted to CRAN, so happily it won't be an issue (fuzzyjoin is in the queue at CRAN now!)

maelle commented 8 years ago

Of course it was useful info for you to know about the bug but I could have avoided having the whole search for the reason here. :laughing:

Good luck with the CRAN submission! :boom: