Planned milestones for future releases

MarcusKlik commented 6 years ago

The currently planned features planned for fst:

version 0.8.4:

Intermediate release to fix the Clang 6.0 build errors (#118)

version 0.8.6:

Multi-threaded serialization of character columns
Basic data.table interface:

ft <- fst("1.fst")

ft[1:1000, .(ColA, ColB)]  # on-disk row subsetting + column selection

ft[ColA > 50, .(ColA)]  # on-disk subsetting using simple expression + column selection

ft[ColA == median(ColB), .(ColB)]  # subseting using custom expression + column selection

ft[ColA == ColB, .(ColSum = ColA + ColB)]  # subsetting + compute on column selection

Note that there is no grouping functionality in this basic interface (yet), but there will be:

on-disk (random) logical row sub-setting (requiring only memory for selected rows) (argument i)
on-disk row sub-setting using an expression (requiring only memory for columns used the expression) (argument i)
on-disk column selection (argument j)
computations on column selection (compute on j)

Basic dplyr interface:
- filter
- select
- slice
- collect
- sample_n and sample_frac (only needs memory for data in the returned sample)
Hashing of column data

version 0.8.8:

For the data.table interface:
- operator := for column binding to an existing fst file
- rbindlist to row bind multiple fst files into a single file
For the dplyr interface:
- add_row
- add_column
- mutate

version 0.8.8 and later

Later features (in random order):

Add on-disk grouping functionality. That requires on-disk sorting, which can be done using a merge sort algorithm.
lapply like functionality creating a fst file using a list of inputs (csv's, custom methods, etc.)
interoperability: a) import data from Apache Parguet files b) Python interface c) C++ interface library d) Julia interface, ...
advanced operations: a) Parallel grouping for specific methods (like +,-,*,/,sum,mean, etc. these methods need a C++ implementation for parallel operations) b) binary search on table key columns (extremely fast sub-setting of a key range) c) Merge operations on multiple fst files (right join to start with, like in data.table) d) multiple fst-files represent a single data set e) set of fst-files can be sorted in parallel into a new set of fst files. This avoids the slow end-phase of sorting algorithms like merge sort. f) user-defined map-reduce operations that can be used on the fst file(s) in parallel. Simple example: a custom median method using 1) sum and count each chunk 2) take results from 1) to calculate median. g) fill a data set range with specific rows from a fst file, overwriting data in-memory (#29).
performance and security enhancements:
a) encryption b) SIMD upgrades to the bit-shifters and pre-serialization filters used in fst c) a plug-in system (C++) for custom compressors to allow users to come up with faster or better compressors d) better character columns compression e) high compression mode for slow IO (network) speeds (#23).

This list is subject to a lot of change depending on features and issues requested/reported by users of the fst package :-)

krlmlr commented 6 years ago

dplyr also implements some operations internally ("hybrid evaluation"), maybe there's potential for code reuse?

shrektan commented 6 years ago

Is it possible to add the ability to write empty data frame to your milestones? It’s a very useful feature, at least for myself, thanks.

MarcusKlik commented 6 years ago

Hi @krlmlr, that's a very nice feature! And yes, I think that it will greatly speed up processing to be able to use the fst internal representation for common methods (a logical vector takes only 2 bits per element in the internal format for example).

The interface will have to be build around an offline-object representing a fst file. All operations should be performed on that abstract object so that they can be parallelized automatically. For summarise that could be done by processing groups in parallel. But you can't call custom R methods from any other thread than the master thread. That means that data could be loaded from disk in the background, but processing would still be single threaded. Perhaps we could use the hybrid evaluation to replace known methods by multi-threaded ones, that would be great.

Thanks for the idea!

MarcusKlik commented 6 years ago

Hi @shrektan, thanks for your question! Writing empty tables is on the list for release v0.8.6. A discussion on that can also be found in #99.

Indeed, it would be nice to be able to use the empty table to store metadata. Also, when row binding is supported, you could have a situation where an empty table would be the first result of an apply call for example.

shrektan commented 6 years ago

That’s great. Also, I’m so exited to see the idea of the data.table interface feature. If that’s possible, I can only say a big WOW. Looking forward to that.

xiaodaigh commented 6 years ago

github.com/xiaodaigh/disk.frame

i am experimenting with a data.table-like on-disk data manipulation package bacled by fst. early signs promising

On 9 Jan. 2018 11:59 pm, "Xianying Tan" notifications@github.com wrote:

That’s great. Also, I’m so exited to see the idea of the data.table interface feature. If that’s possible, I can only say a big WOW. Looking forward to that.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fstpackage/fst/issues/117#issuecomment-356277127, or mute the thread https://github.com/notifications/unsubscribe-auth/AESfJSWjIrXPeYLEwqqoLIEyUUOQ5zGCks5tI2K3gaJpZM4RLfcy .

MarcusKlik commented 6 years ago

Hi @xiaodaigh, your package is very impressive! Would it be possible to run the disk.frame interface against a parallel cluster (with each node processing some chunks)?

One of the bottlenecks if you have a lot of data is that you can use many threads to read data from file (for example with fst) but running (custom)R methods always has to be done on the (single) main thread. Combining a cluster with a fst back-end would give you the option to tune where you need the the most CPU power 👍

xiaodaigh commented 6 years ago

Firstly it's a great compliment from the creator of fst! I wouldn't call my package impressive if the bar is fst. Anyway it's still early stages but thanks!

@MarcusKlik > Would it be possible to run the disk.frame interface against a parallel cluster (with each node processing some chunks)?

I think it's possible and probably will be done at some point. One thing I am not certain about is that I am not sure how working with clusters in R will be like. I know Julia was designed to work well with clusters so I was thinking about building something in Julia and make it call R (hopefully Julia's native DataFrame will catch up in performance and it can get a native fst reader).

Crazy idea: we can a start a company making a fst-backed distributed data manipulation tool that maximizes single machine performance as well.

MarcusKlik commented 6 years ago

That's already a nice sales pitch you have there :-)

fstpackage / fst

Planned milestones for future releases #117