fstpackage / synthetic

R package for dataset generation and benchmarking
GNU Affero General Public License v3.0
20 stars 1 forks source link

Additional Streamers to Support Random Access Reads and Appends #41

Open phillc73 opened 4 years ago

phillc73 commented 4 years ago

One of the biggest selling points for me of fst is the ability to randomly access (and append in v0.9.2!) disk stored data. That is, load specific data into an R session, without the need to restore an entire file. This obviously supports larger than memory datasets.

Therefore, in this sense fst functionality is more like an SQL database, rather than other binary file formats like RDS, qs and Feather.

It would be interesting to firstly add additional streamers to this package to support stand-alone, serverless, SQL databases which are already supported in R. For example:

Perhaps also consider other larger than memory dataset initiatives in R to benchmark the random read functionality. For example:

There are probably others I'm not aware of.

This enhancement request also predicates that the existing streamer_fst() is updated to support random reads (and appends when available). Maybe that's a separate issue.

MarcusKlik commented 4 years ago

Hi @phillc73, thanks for your request!

Yes, adding more streamers for various file formats and databases is certainly one of the main goals for this package. As you say, we will have to add random reads to the streamers, and think about a default scheme to measure the performance of random reads.

For example, we could add a parameter to bench_rows() to allow for reading from random (or predefined) offsets, e.g. offset = c("none", "random", some_percentage). During benchmarking, these settings can be used to define the starting row for reads.

The selected offset can be added to the benchmark results (as a percentage). That way, the user can determine the actual effect of a selected offset on read speed (if the streamer allows for random reads). For some databases or formats, reading from the top of a dataset might be cheaper than reading from the bottom (e.g. random access csv files).

What do you think, would that be a good solution for adding random read to the benchmarks?

thanks!

phillc73 commented 4 years ago

I hadn't really looked at bench_row() but your suggestion seems to make sense and I for sure hadn't thought about random reads from beginning, middle and end.

The only thing I would add is that not all SQL queries return a continuous block or rows (assuming SQL dbs are added to the streamers). The case should be considered where a conditional select is applied e.g.

dbGetQuery(con, "SELECT mpg, cyl FROM mtcars WHERE disp >= 200")

mtcars[disp >=200, c("mpg", "cyl")] # when fsttable supports this form