fstpackage / fst

Lightning Fast Serialization of Data Frames for R
http://www.fstpackage.org/fst/
GNU Affero General Public License v3.0
618 stars 42 forks source link

Bug Report: In R: fst crash in both saving and reading very large files. (500M+ rows and 50+ columns, 100+GB) #46

Closed wei-wu-nyc closed 6 years ago

wei-wu-nyc commented 7 years ago

I am having problem saving and reading very large files in fst R package. I used both CRAN version and development version. The problem occurred intermittently. Every few tries on saving will get me a success save. Most of the reading so far have failed. However, if I read a subset of the file, the reads are mostly successful. The data is a big data.table frame. I don't know how to provide more info. The following is the error from read.fst():

system.time({a=read.fst('/dev/shm/AllHorizonDT_00.fst',as.data.table=T)}) caught segfault address 0x7f531143e038, cause 'memory not mapped'

caught segfault address 0x7f71c25db020, cause 'memory not mapped'

Traceback: 1: .Call("fst_fstRead", PACKAGE = "fst", fileName, columnSelection, startRow, endRow) 2: fstRead(fileName, columns, from, to) 3: read.fst("/dev/shm/AllHorizonDT_00.fst", as.data.table = T) 4: system.time({ a = read.fst("/dev/shm/AllHorizonDT_00.fst", as.data.table = T)})

Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace Selection: caught segfault Selection: address 0x7f531143e038, cause 'memory not mapped'

This is the error from write.fst(): Saving All data for all horizons ... caught segfault address 0x7f201085103a, cause 'memory not mapped'

Traceback: 1: .Call("fst_fstStore", PACKAGE = "fst", fileName, table, compression) 2: fstStore(normalizePath(path, mustWork = FALSE), x, as.integer(compress)) 3: write.fst(AllHorizonDT, path = filename, compress = fst_compress_level) 4: save_horizon_data(AllHorizonDT, formatsub = "Formatted/", maturedsub = "matured/", agesub = "Ages/", horizsub = "AllHorizon", savefst = T) 5: eval(expr, envir, enclos) 6: eval(ei, envir) 7: withVisible(eval(ei, envir)) 8: source("Product_DataPrep.R")

Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace

Neither are very informative. Basically looks like some kind of member problem.

Let me know what I can do to help debugging this. Thanks.

MarcusKlik commented 7 years ago

Hi @wei-wu-nyc , thanks for the bug report. In the last weeks I fixed a lot of so called ASAN and UBSAN warnings from CRAN and that also solved a few small memory leaks. I just pushed much of that code to the development branch of the fst package and would be much obliged if you could rerun your code with that latest version to see if you still get the crashes?

devtools::install_github("fstpackage/fst", ref = "develop")

thanx for your time invested!

MarcusKlik commented 7 years ago

The current version of fst is limited to 2 billion rows (INT_MAX). I will upgrade the row counter to a 64 bit unit_64 to lift that restriction. The same is true for the number of columns although I don't expect many users to have more than 2 billions columns :-). Do you have more than 2 billion rows in your data set?

wei-wu-nyc commented 7 years ago

@MarcusKlik, I don't have more than 2 billion rows now. It is about 500 Million rows. So that was not a problem.

I have installed the new develop version. (BTW, the test is on a Linux AWS.) So far the only test I did was to use the new version to read the .fst file that was generated with previous version of fst package. It still crashed the couple of times I tried.

I will try to test:

  1. write.fst() using the new version
  2. read.fst() reading the file generated by the new version. Since the data set is pretty large, it takes time to do this. (Every time R crashes, the data had to be reloaded again.) Anyway, I will report the results back later.
wei-wu-nyc commented 7 years ago

Even after installed the current development version, both read.fst() and write.fst() still crash. It seems to related with the size of the data. The data set I am working on has about 510 Million rows data, each rows takes about 400bytes. So 500M rows comes to about 200GB data size.

For both read.fst() and write.fst() crash on the full dataset. But if I write a subset of the data or read a subset of the .fst file (from a .fst file that contains the full dataset that successfully saved after many tries.), it will succeed for rows<=250Million rows. The size of the data is about 100GB.

MarcusKlik commented 7 years ago

Hi @wei-wu-nyc , I seems strange that you can load a subset but only up until 250 * 10^9 rows. Are you having the same problem when you load more than 250 million rows but select only a single column?

like so, for example:

fst_file <- "/dev/shm/AllHorizonDT_00.fst"  # fst file
column_names <- colnames(read.fst(fst_file, from = 1, to = 1))  # read column names

# Read single rows and discard to conserve memory
lapply(column_names, function(col) { read.fst(fst_file, col); gc(); return(TRUE) } )

thanks for testing, it's hard to determine the cause of the problem without the actual data, but I will try to reproduce your error by generating a data set with more than 250 million rows...

wei-wu-nyc commented 7 years ago

@MarcusKlik It may take some time for me to test this. I have deleted the big .fst file. Now I am unable to write.fst() the big file anymore. (It took me a few try to finish the saving without crashing.) I also don't have much time in the next couple of days. I will post here when I get more testing done. Thanks a lot for your package and your work. The load and saving speed of fst package is really impressive, as well as the ability to load subset of data from binary file. I am really looking forward to using this package in more regular basis. It can save me a lot of time. However, due to the crash with large size file, I have to work around the problem by splitting the data into separate files and then read and merge the data. However, the rbindlist() call take a long time, and negate the time saved on loading the data using read.fst() vs. load().

MarcusKlik commented 7 years ago

Thanks @wei-wu-nyc ! I will try to reproduce your problem myself with random data sets large than 250 million rows. If you could pinpoint for which column type (factor, character, real, logical or integer) the problems occurs, that would be very helpful. Perhaps I can ask you to rerun your code later with updated versions of the fst develop branch?

wei-wu-nyc commented 7 years ago

@MarcusKlik I quick report back. My data has columns with: character, integer and numeric type. I am not successful in saving the whole dataset anymore.

  1. I can save the character fields with all rows repeatedly. I can also reading back in all rows of the file with character fields with no problem.
  2. I can also save the integer fields with all rows. But I core dumped once. But succeeded twice. Reading in all the rows from the file with all the integer fields are all fine.
  3. Saving the fields with numerical fields all failed. I didn't get a chance to test reading in the numerical fields, as I wasn't able to save these fields for all rows. That's the test so far. Thanks.
MarcusKlik commented 7 years ago

Hi @wei-wu-nyc, that offers some additional information. Apparently, the fst package has problems allocating more than 2 GB of memory for a single numerical column (250 million rows times 8 bytes per numerical value). That's exactly INT_MAX (signed), so I'm guessing there is some integer rounding being performed where a 64-bit size is required. Thank you for reporting, I will let you know when I find the problem.

wei-wu-nyc commented 7 years ago

@MarcusKlik, this sounds like a reasonable hypothesis. Hopefully, you will be able to find and fix the problem. I am glad I can help. I am using fst now and looking forward to using it more and more. What platform are you developing fst package on? Linux, Mac, Windows? Also I don't see a makefile. What compiler and environment do you use for the development of fst? I want to see if I can clone the project and take a look on my machine. (Haven't done anything in C/C++ for 15+ years :-)) Thanks.

MarcusKlik commented 7 years ago

Hi @wei-wu-nyc , I believe I have located the problem (indeed related to a downcast to an integer). The latest version from the develop branch should have a fix for your problem, which you can install with:

devtools::install_github("fstpackage/fst", ref = "develop")

I hope that solves your problem! On my machine I can now write 1 billion numericals (8 GB of data) without problems. You might have to re-write the fst file first to be sure...

MarcusKlik commented 7 years ago

I am developing the fst package both on Ubuntu and on Windows. And using Travis the package is also build on Mac (which is done automatically after each commit). The makefile is a special R package makefile (this one) which compiles the 3rd party LZ4 and ZSTD compressors and the C++ code from the fst package. Soon, the C++ core of the fst package will be refactored into it's own library. From that moment on, it will be much easier to develop it using for example Visual Studio (you can use the newest version to develop cross-platform code using the Clang compiler). If you want to build the project out of the box, it's easiest to clone the code, create a RStudio project in the package directory and then use the build button in RStudio! (but make sure you have RTools installed)

wei-wu-nyc commented 7 years ago

@MarcusKlik After doing the reinstall of the latest develop version, I can confirm this release fixed the core dump on saving and loading large data set and files. I did not do many tests for more complete results, (each run takes some time) but for the same big data table I had before, I can write.fst() the whole file in one shot. And I can reload the big file into memory with read.fst(). Both of these tasks causing problems before. Thanks a lot for the quick fix.

MarcusKlik commented 7 years ago

Hi @wei-wu-nyc , that's great, thanks for reporting! You should be fine now at least up until 2 billion rows. Support for long vectors (>2^31 elements) is present in R from version 3.0.0 and those are on the feature list as well (but not at the top though :-)). Additionally when the multi-chunk feature is implemented (fst.rbind), larger data sets could also be stored as a series of large (2 billion) row chunks. Much obliged for all the effort you put in!

wei-wu-nyc commented 7 years ago

Thanks @MarcusKlik for all the impressive work on this package. Actually fst.rbind was one feature that I almost requested here. To work around the crashing on big data set problem I reported here, I split and saved the data in multiple files. And read in the smaller files and then have to rbindlist() in memory. This basically doubled in-memory memory foot print. fst.rbind on a set of files without using 2X memory will be really nice. Any plan for multi-thread, multi-core support? I see some of the R io packages utilize multi-core to speed up things. fst package seems to only use one cpu, although it is even faster than packages using many cpus, if you can utilize multiple cpus, it will make things even faster.

MarcusKlik commented 7 years ago

Hi @wei-wu-nyc , there are definitely plans for multi-core support! First multi-threaded compression and decompression and multi-threaded IO. And later also multi-threaded sorting and grouping. I've added a list of milestones currently planned for fst with some features in #48. Any feedback you have is very welcome! As you can see, many plans and much work to do :-). Also, I believe it would be a good idea to take a thorough look at what 3D XPoint technology will mean for fast serialization. Apart from the speed, random seeks will be much cheaper (single byte access) and that means a serialization format can be more flexible ('random access') than ever before without loosing performance. That will be very important for serialization formats like fst in the future!

wei-wu-nyc commented 7 years ago

@MarcusKlik Thanks for the heads-up. I took a look at issues #48, one feature I am looking forward to seeing the ability to import Spark Parquet files efficiently. Currently, the serialization to Spark Parquet files from R is not just painfully slow, it is mostly broken for my purposes (accessing or importing a large/massive amount of data into R data.frames or data.tables.). When trying to transfer large amount of data between R and Spark, it's painfully slow and frequently crashes. If you provide a efficient way to efficiently serialize data access between R and parquet files, that would be very useful for us.

MarcusKlik commented 7 years ago

@wei-wu-nyc Indeed, being able to read from Parquet files would be very valuable for interoperability as many languages already have an interface to the Parquet file format. And Parquet files have a layout similar to the fst format and they also have a limited amount of (basic-) types (but there is no random access compression in Parquet). The same arguments hold true for the Apache Arrow in-memory format and it would be interesting to have the C++ code bases of fst and Arrow sharing more features in the long run. Recently, @wesm opened a feature request in Arrow to implement a compressed buffer stream as a series of compressed blocks, like in fst. That would be an extremely useful feature to have in Apache Arrow!

wei-wu-nyc commented 7 years ago

@MarcusKlik The use case for Parquet files is that: Now a days, many of the big data sets are on Spark platform. When working within the spark environment, it works OK. However, when one needs to load a relatively big dataset into R, the bottleneck is the serialization of data between Spark and R.

I forgot to add in my previous post. The other set of very useful features are the ability to work on the .fst file(s) offline without needing to load the whole data files into memory: merging two .fst files into one, merging in-memory data.frame or data.table into an existing .fst file, adding column(s) without loading the whole data set into memory, selectively loading subset of the data file into memory etc. The use case for this is: For many of the very big dataset, for some big jobs we do load the whole data into memory. However, this is costly in resources, needed planning. For some simply data prep, and testing steps, we may just want to work on the data without utilizing the big machines, but using a modest size machine manipulate the data off-memory.

I am a data scientist, not a programmer. These are just feedbacks from my perspectives. Thanks.

MarcusKlik commented 7 years ago

@wei-wu-nyc , thanks for some valuable feedback! Working off-memory with large data sets is exactly the goal of the fst package in the long run. With (NMVe) storage getting faster and upcoming 3D XPoint memory devices, processing very large data sets with relatively small machines is suddenly in-scope and I believe there is room for frameworks that work in between large clusters and limited single-machine solutions (but using very fast storage devices).

Just to elaborate on the features that you describe:

fst_table <- fst(c("1.fst", "2.fst", "3.fst"))  # single logical object referring multiple files
fst_table[Year > 2015, sum(Amount)]  # some operation on the set of files
wei-wu-nyc commented 7 years ago

@MarcusKlik my requests are simpler than you described. But your description add much more feature set than I imagined:

  1. merging two fst files together. I was just thinking about if I have two tables or two .fst files that have the same format and they were generated on different subset of the data or for different time periods etc. I was just thinking a way to combine the data without loading all into memory. I remember one of you earlier version of .fst package had a fst.merge() function. I did not test it. But I was very happy to see that.
  2. However, your spec of single logical object referring to multiple fst files, is a much more powerful concept than what I asked for. It will achieve what I wanted without even merging the files. (However, the option to merge into a single file is useful in some cases. For example, when the need to accessing all the data as a whole data set is permanent. Saving into one big file may help performance wise? Also the ability to represent a big object into separate files may also help performances when reading and saving the files when the multi-thread support is implemented.
  3. Your second line of hyperthetical code add much more imagination for future flexibilities and possibilities. If there is a way to have an logical object point to a .fst or a set of .fst files, and allow the user to manipulate the table (modify a column or adding a column, set the values based on the values of other columns) as if it was a data.frame or data.table. Then the user can save the modified data.table to .fst files. That would be much more powerful. But I would imagine that that would be much more complicated programming task than some much narrower scope. I can imagine that read.fst() can provide a flag to read.fst(..., offmemory=T), then we can treat the return object as if it is a data.frame or data.table.
  4. When I said merging in-mermory data with fst file, that was a very simple request. I was just thinking of a task closely related to point 1. Or very similar to write.csv(,append=T). Basically I am creating a new set of data that has the same format as a big data set on disk in a fst file. I just want the ability to append or rbind() the data in-memory into the existing fst file.
  5. However, what you described as merging (I guess I used the wrong word in my post. I meant for appending, merging has more specific meanings. But I am glad I (mis)used the word.), which will merge or join the in-memory data with the external data based on key column(s). That would be powerful, almost like a database.

I really was first impressed with fst's blazing fast speed to save some run time even though your package is such a new project. However, build on top of this exceptional performance, there are so many possibilities. Hope you can gradually add various features into the package. Some of the features I can see that are probably relatively easy to implement, (combine different fst files, append to fst file etc.), some I can see can be very complicated. I do think that the main reason users of fst are looking for your package is that the IO serialization to and from R has become a bottleneck with no very good available solutions. The edge your package have is the serialization performance, so whatever new features you add, please keep this in mind. Also, when you prioritize adding features, please keep that in mind. If you add those serialization performance enhancement soon, I think you will gain users faster. Performance first, complicated features later.

MarcusKlik commented 7 years ago

@wei-wu-nyc Thanks a lot for that. And you bring a very compelling argument, performance first and complicated features and 'syntactic sugar' later! Some of the performance features should definitely be moved upwards, as they will make fst increasingly useful for working on large data sets (I will keep the data.frame interface though as it is relatively easy to implement :-)). Multi-threaded compression and IO for example will give a large boost to read and write performance.

On your specific remarks:

  1. Being able to rbind an in-memory data set as well as a second fst file is very useful for large data sets, so I will have fst.rbind accept an in-memory data set as well as the logical object pointing to a fst file. The latter option will take only a very small amount of memory to execute and compression can be maintained without re-compressing the data.

  2. There will only be a small performance hit for working with a logical object pointing to several files instead of one, at least with modern SSD's as the access time for opening a file is very short. The biggest advantage of a setup with multiple files is that you can write several files simultaneously, which is very useful in multi-threaded scenario's. Reading from file a single file can be done with multiple threads anyway because of the random access fst provides, so there is less of an advantage there.

  3. Your description pretty much sums up what I would like to have in the advanced operations milestone (#48). Working with a 'proxy object' that can be used as a data.table but without the memory overhead, so only loading data from disk as needed and appending new columns and rows to the existing on-disk data set. If we have sub-setting, sorting, merging, grouping and appending (rows and columns), we pretty much have a full (dplyr-like) interface to the on-disk data. But as you say, first things first :-)

  4. As in 1, appending will be available in fst soon.

  5. The merging would be an important feature to have for combining large data sets. In-memory merges normally take double the memory of the original data sets but with sorted on-disk data sets it could be done using only a small amount of memory for buffering data.

Thanks a lot @wei-wu-nyc for all the time you spent on reviewing the milestones and planned features!

wei-wu-nyc commented 7 years ago

@MarcusKlik The bug of crashing on reading 250 million rows+ is back. I am using the latest development version installed by devtools::install_github("fstPackage/fst"). Writing the data set was OK with no crash. (Previously there were problem both in reading and writing.) But reading the large .fst file into memory crashed the R. 10 Days ago, the bugs of writing and reading the same data set were both fixed. Thanks.

MarcusKlik commented 7 years ago

Hi @wei-wu-nyc . That's strange, I see that the commit that solved your problem 10 days ago was the latest commit on fst on the official repo, so perhaps the bug wasn't solved after all. On my system, the following code runs without problems, could you check if this works for you too?

# Install latest develop
require(devtools)
devtools::install_github("fstpackage/fst", ref = "develop")

# vector size larger than INT_MAX
x <- data.frame(A = 1:1000000000)
write.fst(x, "testBig.fst")

rm("x")  # save some memory
y <- read.fst("testBig.fst", as.data.table = TRUE)

typeof(y$A)
format(object.size(y), units = "GB")

[1] "integer"
[1] "3.7 Gb"

and for double's:

# vector size larger than INT_MAX
x <- data.frame(A = as.double(1:500000000))
write.fst(x, "testBig.fst")

rm("x")  # save some memory
y <- read.fst("testBig.fst", as.data.table = TRUE)

typeof(y$A)
format(object.size(y), units = "GB")

[1] "double"
[1] "3.7 Gb"

thanks for your time!

MarcusKlik commented 7 years ago

Hi @wei-wu-nyc , perhaps the issue here is that in your report you write that you used

devtools::install_github("fstPackage/fst")

to install fst. However, the ref argument of devtools::install_github defaults to the 'master' branch. I hope that when you use

devtools::install_github("fstpackage/fst", ref = "develop")

your problem is solved. I noticed that most package repo's on GitHub use the 'master' branch for the latest commits, effectively making it the 'develop' branch. Perhaps it would be more clear to users to just have a master branch and feature branches, instead of a master, develop and feature branches!

wei-wu-nyc commented 7 years ago

@MarcusKlik You beat me to it.

  1. For you examples, when I run them on my installed packages, they either crashes or give me error message about unknown column like the following error:

    y=read.fst('testBig.fst') Error in fstRead(fileName, columns, from, to) : Unknown type found in column.

  2. I noticed the same thing as you did. If I install the develop version use ref="develop". Your examples run without crash.
  3. However, for my data, I still get core dump when I read back previously generated .fst file. I will need to test it when I re-generate the .fst using the new development version of the package. I will report back here.
  4. If this is the problem, then you should update in the home page or README of fst package, in there you have the following lines:

    You can also use the development version from GitHub:

install.packages("devtools")

devtools::install_github("fstPackage/fst")

wei-wu-nyc commented 7 years ago
  1. In this case, you can see a use case for converting from RData or RDS file to fst with limited memory usage :-)
MarcusKlik commented 7 years ago

Ha, quite right! I will update the README for now but probably switch to using the 'master' branch soon, thanks!

wei-wu-nyc commented 7 years ago

@MarcusKlik Reporting status back. Using the ref="develop" version of fst for both writing and reading my large data set do not cause core dump.

MarcusKlik commented 7 years ago

Great, thanks! The README and website now have the correct install.packages options specified, thanks for pointing that out!

renkun-ken commented 6 years ago

I'm trying to write a big data.table (50M rows, 60 columns, a few int, mostly double), and write.fst writes it on disk very quickly without any error but I can't read.fst (Unknown type found in column).

I tested writing all rows of 5 columns and it works.

MarcusKlik commented 6 years ago

Hi @renkun-ken, thanks for reporting, are you using the dev version of fst or the version from CRAN? (with the CRAN version, that error has been reported when disk ran out of space and the header was incorrectly written to file)

MarcusKlik commented 6 years ago

Hi @renkun-ken , when you only write the first few rows, does the error occur as well, e.g.:

fst::fstwrite(dt[1:10, ], "test.fst")
fst::fstread("test.fst")
MarcusKlik commented 6 years ago

Hi @renkun-ken, I have done a small test with the largest table that fits in the memory of my laptop and has 50 columns with integers and doubles (60 / 40 ratio):

image

library(data.table)

create_column <- function(nr_of_rows)
{
  if (sample(c(TRUE, FALSE), 1, prob = c(0.4, 0.6)))
  {
    return(sample(1L:100L, nr_of_rows, replace = TRUE))
  }
  runif(nr_of_rows, -10, 10)
}

create_table <- function(nr_of_rows, nr_of_columns)
{
  dt <- data.table(Col1 = create_column(nr_of_rows))

  lapply(2:nr_of_columns, function(x){ dt[, paste0("Col", x) := create_column(nr_of_rows)] })

  dt
}

# Create table and measure size
dt <- create_table(17000000, 50)
dt_size <- as.numeric(object.size(dt))

# Writing benchmark
bench_write <- microbenchmark::microbenchmark(
  fst::fstwrite(dt, "large_table.fst", compress = 60), times = 1)

# Writing speed
dt_size / bench_write$time
#> [1] 0.5262713

# clear memory for reading and garbage collect (output removed)
rm("dt")
gc()

# Reading benchmark
bench_read <- microbenchmark::microbenchmark(
  fst::fstread("large_table.fst"), times = 1)

# Reading speed
dt_size / bench_read$time
#> [1] 1.183014

# Compression ratio
file.size("large_table.fst") / dt_size
#> [1] 0.7512043

Everything looks normal, at this moment I can't reproduce your error with this (smaller) table. As soon as I have access to a larger sytem, I will rerun with a 60M row / 60 column table, so stay tuned :-)

renkun-ken commented 6 years ago

I'm using CRAN version. When I try devtools::install_github("fstpackage/fst@develop"), it says that I don't have necessary build tools to build from source, but I don't find any documentation about this.

renkun-ken commented 6 years ago

@MarcusKlik I tried saving top 100 rows of the big data.table, it works without any problem.

MarcusKlik commented 6 years ago

@renkun-ken, if you install the latest RTools, you should have all the tools you need, you can check with:

devtools::find_rtools()
#> [1] TRUE

hope that works for you!

renkun-ken commented 6 years ago

I'm working on Ubuntu 16.04, the toolchain should work out of box.

MarcusKlik commented 6 years ago

that's very strange. Indeed, on Ubuntu 16.04 you shouldn't have to install anything. I just downloaded a fresh Ubuntu 16.04 image and installed it on VMPlayer. Then after installing r-base, libssl-dev, and libcurl4-openssl-dev in the VM, I can install devtools without problems in R.

After installing devtools, your command

devtools::install_github("fstpackage/fst@develop")

works without problems on that fresh install. I didn't try with RStudio, are you using that to run your package installation code? If so, could you try to tun the code from the command line?

MarcusKlik commented 6 years ago

Hi @renkun-ken, hopefully you can get the dev version working, I checked the benchmark above on a larger machine using your 50M rows and 60 columns, and get no errors. Perhaps installing the dev will solve your problems, if not, please let me know!


library(data.table)

fst::fstsetthreads(10)  # lower to 10 threads
#> [1] 40

create_column <- function(nr_of_rows)
{
  if (sample(c(TRUE, FALSE), 1, prob = c(0.4, 0.6)))
  {
    return(sample(1L:100L, nr_of_rows, replace = TRUE))
  }

  runif(nr_of_rows, -10, 10)
}

create_table <- function(nr_of_rows, nr_of_columns)
{
  dt <- data.table(Col1 = create_column(nr_of_rows))

  lapply(2:nr_of_columns, function(x){ dt[, paste0("Col", x) := create_column(nr_of_rows)] })

  dt
}

# Create table and measure size
dt <- create_table(50000000, 65)
dt_size <- as.numeric(object.size(dt))

1e-9 * dt_size  # show object size in GB
#> [1] 20.00001

# Writing benchmark
bench_write <- microbenchmark::microbenchmark(
  fst::fstwrite(dt, "large_table.fst", compress = 75), times = 1)

# Writing speed
dt_size / bench_write$time
#> [1] 0.7539288

# clear memory for reading and garbage collect
rm("dt")
gc()
#>           used (Mb) gc trigger    (Mb)   max used    (Mb)
#> Ncells  584821 31.3     940480    50.3     940480    50.3
#> Vcells 1010141  7.8 2870259362 21898.4 2701435254 20610.4

# Reading benchmark
bench_read <- microbenchmark::microbenchmark(
  fst::fstread("large_table.fst"), times = 1)

# Writing speed
dt_size / bench_read$time
#> [1] 2.608564

# Compression ratio
file.size("large_table.fst") / dt_size
#> [1] 0.6476882
wei-wu-nyc commented 6 years ago

@MarcusKlik @renkun-ken Just want to chime in a little bit. I have been using fst to read and write fst files with several GB in size for the last few months with minimum problems. I use them across platforms both on Ubuntu 16.04 and on Mac OSX. I sometimes get the read errors when the versions of fst were out of sync. Solved after installing the newer versions. I am using the develop version of fst.

MarcusKlik commented 6 years ago

Thanks @wei-wu-nyc and good to read fst works for your workflow! Changing the format between commits is far from ideal. My goal with that is that I make those changes now so that after the next CRAN release I won't have to. After that release fst will be backwards compatible, so any changes after that will lead to 'version dependent code' inside the C++ core.

Additionally, now that most data-types are in the format, changes to the format will be far less frequent!

renkun-ken commented 6 years ago

@MarcusKlik I install from github in R terminal. When I save the big data.table with compress = 0, it works without a problem and the file is readable. When I save it with compress = 100, all 40 threads start computing in parallel and the file reaches 89GB and stopped, then it hangs with no CPU working and no file size increasing any more. Not sure what happened.

MarcusKlik commented 6 years ago

Hi @renkun-ken, so the Ubuntu 16.04 installation problem only occurred when using devtools from the RStudio IDE ?

Thanks for testing fst at the limits (I guess :-)) of your system, that's exactly the use-case for which it is build. Increasing the table size to 200 million rows and 60 columns (in the code sample above), I get an in-memory table of 79.2 GB. During the 'benchmark' I see the following performance graphs (processor and memory usage):

image

Everything works as expected. Are you hitting any boundaries on your system, e.q. memory limit or disk size limit? Perhaps you could try with a lower number of cores, maybe there are specifics to the OpenMP implementation on your system that make the threads lock?

The uncompressed write and read currently use a single thread only, so no OpenMP problems are expected there...

renkun-ken commented 6 years ago

@MarcusKlik There are 40 CPUs and 500GB RAM on the server. I notice that during the hanging, the space of my home disk goes from 65GB to 20KB, but the target file is on a disk of much spare space (more than 3TB). I'm not sure why the home disk is exhausted. When I kill the whole session, the space is freed soon.

wei-wu-nyc commented 6 years ago

@MarcusKlik , the bug of core dump when reading or writing .fst file with more than 256MM rows, seem to be back. I am still running similar data set, with several hundred rows.

  1. When writing, sometimes I was able to save data with slightly more than 25610241024 rows to fst files. But when I try to read the file back, fst seems to crash past 256MM rows.
  2. When the data is several hundred million rows, write.fst() crashes.
  3. I tried with different compression levels 100 and 75, both have problems. I ran the develop version I installed in April with no problem for more than 6 months until recently. Due to another problem, I have to re-install fst, so I re-installed the develop version of fst (the CRAN version still not supporting large files I think.) a week of two ago. The above problems all happened in the above described environment: Ubuntu Linux, R 3.3.1, fst develop version. Thanks. Hope you can have a quick fix for this. Currently, I am breaking the files into multiple chunks. But I am not sure if there is any data corruption in the saved fst files or not, even when the row numbers are below 256MM and no crashes in reading and writing them.
MarcusKlik commented 6 years ago

Hi @wei-wu-nyc, thanks for reporting, I will test the latest develop version with a data-set with more than 256 M rows and will get back to you!

MarcusKlik commented 6 years ago

Hi @wei-wu-nyc, I can confirm your problem using a data frame with columns that have in-memory sizes larger than 2^31 bytes. To resolve, I have upgraded all size-related pointers in the fst core code to 64-bit values. With the latest develop version, all row limits should be lifted, including for tables with more than 2^31 rows (so 8 GB columns of integers for example). The following script shows an example:

test_large_table <- function(dt) {
  # object size in GB
  print(object.size(dt), units = "Gb")

  # write with moderate compression
  fst::fstwrite(dt, "1.fst", compress = 50)

  dt2 <- fst::fstread("1.fst")

  sum(dt[[1]] != dt2[[1]]) == 0
}

# double column size > 2^31 bytes (INT_MAX in C++)
gc()
test_large_table(data.table::data.table(Double = as.numeric(1:500e6)))
#> 3.7 Gb
#> [1] TRUE

# integer column size > 2^31 bytes (INT_MAX in C++)
gc()
test_large_table(data.table::data.table(Int = as.integer(1:1e9)))
#> 3.7 Gb
#> [1] TRUE

# character column size > 2^31 bytes (INT_MAX in C++)
gc()
test_large_table(data.table::data.table(Char = rep('A', 500e6)))
#> 3.7 Gb
#> [1] TRUE

All three tables pass the fstwrite / fstread test now. The test sets above have number of rows smaller than 2^31 but column sizes larger than 2^31. When I have access to a computer with more memory I will also test the cases with nr_of_rows > 2^31.

For now, I hope this will solve the problems you encountered! And thanks a lot for reporting them!

wei-wu-nyc commented 6 years ago

@MarcusKlik The new develop version seems to fix the issue I have in writing and reading large data sets. Thanks. (Haven't done too extensive testing yet.)

MarcusKlik commented 6 years ago

Hi @wei-wu-nyc, that's great! Please let me know if you encounter any more issues. Tables with more than 2e9 rows require some additional thinking, because the row selection cannot be done with integer arguments anymore (I'm thinking of allowing numerical and int64 values to make a row selection...)