djvanderlaan / lvec

Handling larger than memory vectors in R - core package
9 stars 1 forks source link

Using lvec with stringdist #16

Open loukesio opened 2 years ago

loukesio commented 2 years ago

Dear djvanderlaan,

Congratulations for the lvec. Just today I learned about it and I want to ask you a question. How can I combine efficiently lvec with stringdist?

I have seen in another comment this cool function

library(lvec)
library(stringdist)

a <- sample(c("jan", "pier", "tjorres", "korneel"), 1E3, replace = TRUE)
b <- sample(c("jan", "pier", "joris", "corneel"), 1E2, replace = TRUE)

chunks <- lvec::chunk(a, chunk_size = 1E1)

dist <- lapply(chunks, function(chunk, a, b, threshold, ...) {
  i <- seq(chunk[1], chunk[2])
  j <- seq_along(b)
  res <- expand.grid(i=i, j=j)
  res$dist <- stringdist(a[res$i], b[res$j])
  res <- res[res$dist <= threshold, ]
  res
}, a=a, b=b, threshold = 2)

dist <- do.call(rbind, dist)

This is pretty neat @djvanderlaan. I want to ask you how your function can work if I have one vector e.g.,

library(tidyverse)
library(stringdist)
#> 
#> Attaching package: 'stringdist'
#> The following object is masked from 'package:tidyr':
#> 
#>     extract

vec <- c("apple","aple","banan","bananan")
stringdistmatrix(vec, useNames = "strings")
#>         apple aple banan
#> aple        1           
#> banan       5    4      
#> bananan     6    6     2

Created on 2022-03-01 by the reprex package (v2.0.1) and I want to compare pairwise all the elements of the vector.