dgrtwo / fuzzyjoin

Join tables together on inexact matching
Other
668 stars 61 forks source link

support for dplyr-style `by` arguments #24

Closed holgerbrandl closed 7 years ago

holgerbrandl commented 7 years ago

dplyr allows to join columns with differing names in lhs and rhs via named vectors as by argument. It would be great if genome_join/interval_join (and its variants) could support the same syntax.

Example:

require(dplyr)
require(tidyr)
require(fuzzyjoin)

arrayRes = structure(list(
    phopho_position_start = c(862L, 426L, 518L, 556L, 519L, 127L),
    phopho_position_end = c(863L, 427L, 519L, 557L, 520L, 128L),
    ensembl_peptide_id = c("ENSP00000363708", "ENSP00000376679", "ENSP00000321606", "ENSP00000264229", "ENSP00000307093", "ENSP00000358142" ),
    external_gene_name = c("BMPR2", "ABLIM1", "CRMP1", "KIAA1211", "MAP6", "SV2A")
), .Names = c("phopho_position_start", "phopho_position_end", "ensembl_peptide_id", "external_gene_name")) %>% tbl_df

hmmerSearchRes = structure(list(
    target_name = c("ENSP00000385014", "ENSP00000385014", "ENSP00000385014", "ENSP00000233057", "ENSP00000233057", "ENSP00000233057" ),
    ali_from = c(248L, 311L, 399L, 248L, 352L, 440L), ali_to = c(262L, 318L, 412L, 262L, 359L, 453L),
    hmm_coverage = c(0.933333333333333, 0.466666666666667, 0.866666666666667, 0.933333333333333, 0.466666666666667, 0.866666666666667)),
.Names = c("target_name", "ali_from", "ali_to", "hmm_coverage")) %>% tbl_df

genome_join(arrayRes, hmmerSearchRes, by=c("ensembl_peptide_id"="target_name", "phopho_position_start"="ali_from", "phopho_position_end"="ali_to"))
tsibley commented 7 years ago

@holgerbrandl It's not documented, but you can do what you want already by passing a different data structure to by:

genome_join(
  arrayRes,
  hmmerSearchRes,
  by = list(
    x = c("ensembl_peptide_id", "phopho_position_start", "phopho_position_end"),
    y = c("target_name", "ali_from", "ali_to")
  )
)
holgerbrandl commented 7 years ago

cool, thanks for info and the great package

tsibley commented 7 years ago

@dgrtwo is the author, not me! :-)