beniaminogreen / zoomerjoin

Superlatively-fast fuzzy-joins in R
https://beniamino.org/zoomerjoin/
GNU General Public License v3.0
96 stars 5 forks source link

Accuracy of memory usage? #84

Open etiennebacher opened 10 months ago

etiennebacher commented 10 months ago

Hi, I just found this package, it looks cool and super useful!

One thing that I noted in the benchmarks is how low the memory usage is. I know that using Rust is more efficient in speed and memory usage, but I also think the numbers reported about memory might be inaccurate. From ?profmem:

[...] nearly all memory allocations done in R are logged. Neither memory deallocations nor garbage collection events are logged. Furthermore, allocations done by non-R native libraries or R packages that use native code Calloc() / Free() for internal objects are also not logged.

I suspect that a lot of memory allocations are not done in R but in Rust and that the memory usage is actually higher than reported. I run into the same thing when I benchmark polars and tidypolars, so I'm interested if you find a workaround 😉

Just to give you an example: in polars, when I take the mean of a column with 100_000, 1_000_000, 10_000_000, or 100_000_000 rows, R reports the same (tiny) memory usage but I clearly see a peak in Windows task manager:

library(polars)

bench::press(
  rows = c(1e5, 1e6, 1e7, 1e8),
  {
    dat <- pl$DataFrame(
      a = rnorm(rows),
      b = rnorm(rows),
      c = rnorm(rows)
    )
    bench::mark(
      dat$with_columns(y = pl$col("a")$mean())
    )
  }
)
#> # A tibble: 4 × 7
#>   expression                  rows     min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                 <dbl> <bch:t> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 "dat$with_columns(y = pl$…   1e5 557.8µs  739.5µs   1218.      3.81KB        0
#> 2 "dat$with_columns(y = pl$…   1e6   3.4ms   3.95ms    123.      3.81KB        0
#> 3 "dat$with_columns(y = pl$…   1e7    38ms  42.06ms     14.7     3.81KB        0
#> 4 "dat$with_columns(y = pl$…   1e8   526ms 526.01ms      1.90    3.81KB        0

I was told there's a linux tool to make more accurate benchmarks when calling other languages from R but I don't remember the name, I'll update this post if I find it.

beniaminogreen commented 10 months ago

Hi there, thanks for flagging this. I was worried about the same problem when I was writing the benchmarks for the package and thought I was measuring both R + Rust memory allocations, but I will look at this more closely given the example you provide.

In my benchmarks, the memory usage does increase linearly in the size of the input, but the graphs make it look like it uses almost no memory at all as the other package has quadratic memory scaling. The linear scaling helped convince me that the benchmarks were accurate, but I will look into this over the weekend to make sure that the memory usage I see isn't just the data being copied / subsetted in R before it's sent to Rust.

If the memory results aren't accurate, I'll let you know if there I find workaround, or I'll remove them if I can't.

Best, Ben

etiennebacher commented 10 months ago

Indeed, the memory usage increases linearly, thanks for the clarification. I'll leave this issue open if you want to investigate more.

library(zoomerjoin)
library(tidyverse)
library(profmem)

# Sample million rows from DIME dataset
data_1 <- as.data.frame(sample_n(dime_data, 10^6))
names(data_1) <- c("id_1", "name")
data_2 <- as.data.frame(sample_n(dime_data, 10^6))
names(data_2) <- c("id_2", "name")

benches <- bench::press(
  n = seq(500, 4000, 250),
  bench::mark(
    jaccard_inner_join(data_1[1:n, ], data_2[1:n, ],
                       by = "name", band_width = 11,
                       n_bands = 350, threshold = .7,
                       n_gram_width = 4
    )
  )
)

benches |> 
  select(n, mem_alloc) |> 
  ggplot(aes(n, mem_alloc)) +
  geom_point() +
  geom_line()

image