fstpackage / fst

Lightning Fast Serialization of Data Frames for R
http://www.fstpackage.org/fst/
GNU Affero General Public License v3.0
618 stars 42 forks source link

High memory allocation on subsetting fst object #231

Open renkun-ken opened 4 years ago

renkun-ken commented 4 years ago

Consider the following example: I create a data.table of 1.22GB and save it to disk. Using fst::read_fst() does not cause memory allocation more than the data. However, subsetting fst object causes memory allocations twice or even tripple the size of the original data in the following example.

library(data.table)
n <- 5000000
dt <- data.table(group = sample(1:3, n, replace = TRUE), key = "group")
for (i in 1:30) {
  dt[, paste0("x", i) := rnorm(.N)]
}
pryr::object_size(dt)
#> Registered S3 method overwritten by 'pryr':
#>   method      from
#>   print.bytes Rcpp
#> 1.22 GB
file <- tempfile(fileext = ".fst")
fst::write_fst(dt, file, compress = 100)
d1 <- fst::read_fst(file, as.data.table = TRUE)
pryr::object_size(d1)
#> 1.22 GB
d2 <- fst::fst(file)
d2 <- d2[d2$group %in% 1:3, ]
pryr::object_size(d2)
#> 1.22 GB
d3a <- dt[group == 1]
pryr::object_size(d3a)
#> 407 MB
d3b <- fst::fst(file)
d3b <- d3b[d3b$group == 1, ]
pryr::object_size(d3b)
#> 407 MB
setDT(d3b, key = "group")
all.equal(d3a, d3b)
#> [1] TRUE

image

Here's a more radical version where the memory allocation of subsetting 1% of the data can be 5x - 10x the data to be selected.

library(data.table)
n <- 5000000
dt <- data.table(group = sample(1:100, n, replace = TRUE), key = "group")
for (i in 1:30) {
  dt[, paste0("x", i) := rnorm(.N)]
}
pryr::object_size(dt)
#> Registered S3 method overwritten by 'pryr':
#>   method      from
#>   print.bytes Rcpp
#> 1.22 GB
file <- tempfile(fileext = ".fst")
fst::write_fst(dt, file, compress = 0)
d1 <- fst::read_fst(file, as.data.table = TRUE)
pryr::object_size(d1)
#> 1.22 GB
d2 <- fst::fst(file)
pryr::object_size(d2$group)
#> 20 MB
d2 <- d2[d2$group %in% 1:50, ]
pryr::object_size(d2)
#> 610 MB
d3a <- dt[group == 1]
pryr::object_size(d3a)
#> 12.2 MB
d3b <- fst::fst(file)
d3b <- d3b[d3b$group == 50, ]
pryr::object_size(d3b)
#> 12.3 MB

image

In my practical use case, I'm dealing with 200-300GB fst files and I need to select a portion from the data and the memory peak looks high.

I thought it was caused by compression but I tried compress = 0 and the results are very similar.

MarcusKlik commented 4 years ago

Hi @renkun-ken, thanks for your question!

Yes, at the moment the implementation of fst[i, j] uses read_fst() to retrieve the correct selection. In your example, you use a logical selector:

# select 1 percent of rows
i <- d2$group == 1

The current implementation basically does:

v <- which(i)

# this reads the entire table in the worst case scenario
res <- fst::read_fst(file, from = min(v), to = max(v))

So although the size of vector v might be only 1 percent of the number of rows, it's really min(v) and max(v) that determine the actual memory needed for the selection operation.

I fully agree that this is far from ideal. What we really need is the option to provide a logical vector to the underlying fstlib library and select the compressed chunks on the fly, so during loading of a column you would have:

  1. decompress a 16 kB chunk (4000 integers for example)
  2. use logical selector to make a selection of the data
  3. store only the selected data
  4. repeat for next 16 kB chunk

With such a setup, fst[i, j] will only use the memory required for the resulting table.

Going further, the logical selector (your i) can be stored in the offline object generated with fst(). That way, we could pipe selection operations on that object without actually loading the data. It would save a lot of memory if we define a specialized object (logical_vec ?) for that. Like bit but with the option to store NA's (that would be 16 times smaller than a logical vector but with identical information).

(that would also be very useful to data.table and dplyr interface implementations...)

thanks!

renkun-ken commented 4 years ago

@MarcusKlik Thanks for your detailed explanation.

In my example, dt is created with key="group", which makes group have a monotonic order.

dt <- data.table(group = sample(1:100, n, replace = TRUE), key = "group")

In this case, choose dt$group == 50 should still select a 1% chunk since all rows that satisfy this condition are lying consecutively, therefore min(v) and max(v) should still be very close to only contain necessary row indices. I'm not sure why in such keyed cases, the memory allocation is still much larger than I expect?

MarcusKlik commented 4 years ago

yes, you're very right, large memory allocations should not be the case for ordered selections!

If I profile the 1 percent selection script, I get:

image

So, strangely enough, I'm not seeing the large allocations that you see (your second image)!

(I'm using a Win 10 system at the moment)

MarcusKlik commented 4 years ago

If I take the which() out of the call, the memory overhead is a bit smaller yet:

image

so the question is, why these extreme differences between these two systems!

m-swirski commented 4 years ago

Hi, is resolving the issue with a logical vector for fstlib planned? It would be extremely useful to have an option of full random access instead of the fst::read_fst(file, from = min(v), to = max(v)) I found that up to several hundred intervals it's usually faster to load them with seperate read_fst instances, rather than getting them all at once, provided they're far apart.