fstpackage / fst

Lightning Fast Serialization of Data Frames for R
http://www.fstpackage.org/fst/
GNU Affero General Public License v3.0
619 stars 41 forks source link

Planned milestones for future releases #117

Open MarcusKlik opened 6 years ago

MarcusKlik commented 6 years ago

The currently planned features planned for fst:

version 0.8.4:

Intermediate release to fix the Clang 6.0 build errors (#118)

version 0.8.6:

ft <- fst("1.fst")

ft[1:1000, .(ColA, ColB)]  # on-disk row subsetting + column selection

ft[ColA > 50, .(ColA)]  # on-disk subsetting using simple expression + column selection

ft[ColA == median(ColB), .(ColB)]  # subseting using custom expression + column selection

ft[ColA == ColB, .(ColSum = ColA + ColB)]  # subsetting + compute on column selection

Note that there is no grouping functionality in this basic interface (yet), but there will be:

version 0.8.8:

version 0.8.8 and later

Later features (in random order):

This list is subject to a lot of change depending on features and issues requested/reported by users of the fst package :-)

krlmlr commented 6 years ago

dplyr also implements some operations internally ("hybrid evaluation"), maybe there's potential for code reuse?

shrektan commented 6 years ago

Is it possible to add the ability to write empty data frame to your milestones? It’s a very useful feature, at least for myself, thanks.

MarcusKlik commented 6 years ago

Hi @krlmlr, that's a very nice feature! And yes, I think that it will greatly speed up processing to be able to use the fst internal representation for common methods (a logical vector takes only 2 bits per element in the internal format for example).

The interface will have to be build around an offline-object representing a fst file. All operations should be performed on that abstract object so that they can be parallelized automatically. For summarise that could be done by processing groups in parallel. But you can't call custom R methods from any other thread than the master thread. That means that data could be loaded from disk in the background, but processing would still be single threaded. Perhaps we could use the hybrid evaluation to replace known methods by multi-threaded ones, that would be great.

Thanks for the idea!

MarcusKlik commented 6 years ago

Hi @shrektan, thanks for your question! Writing empty tables is on the list for release v0.8.6. A discussion on that can also be found in #99.

Indeed, it would be nice to be able to use the empty table to store metadata. Also, when row binding is supported, you could have a situation where an empty table would be the first result of an apply call for example.

shrektan commented 6 years ago

That’s great. Also, I’m so exited to see the idea of the data.table interface feature. If that’s possible, I can only say a big WOW. Looking forward to that.

xiaodaigh commented 6 years ago

github.com/xiaodaigh/disk.frame

i am experimenting with a data.table-like on-disk data manipulation package bacled by fst. early signs promising

On 9 Jan. 2018 11:59 pm, "Xianying Tan" notifications@github.com wrote:

That’s great. Also, I’m so exited to see the idea of the data.table interface feature. If that’s possible, I can only say a big WOW. Looking forward to that.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fstpackage/fst/issues/117#issuecomment-356277127, or mute the thread https://github.com/notifications/unsubscribe-auth/AESfJSWjIrXPeYLEwqqoLIEyUUOQ5zGCks5tI2K3gaJpZM4RLfcy .

MarcusKlik commented 6 years ago

Hi @xiaodaigh, your package is very impressive! Would it be possible to run the disk.frame interface against a parallel cluster (with each node processing some chunks)?

One of the bottlenecks if you have a lot of data is that you can use many threads to read data from file (for example with fst) but running (custom)R methods always has to be done on the (single) main thread. Combining a cluster with a fst back-end would give you the option to tune where you need the the most CPU power 👍

xiaodaigh commented 6 years ago

Firstly it's a great compliment from the creator of fst! I wouldn't call my package impressive if the bar is fst. Anyway it's still early stages but thanks!

@MarcusKlik > Would it be possible to run the disk.frame interface against a parallel cluster (with each node processing some chunks)?

I think it's possible and probably will be done at some point. One thing I am not certain about is that I am not sure how working with clusters in R will be like. I know Julia was designed to work well with clusters so I was thinking about building something in Julia and make it call R (hopefully Julia's native DataFrame will catch up in performance and it can get a native fst reader).

Crazy idea: we can a start a company making a fst-backed distributed data manipulation tool that maximizes single machine performance as well.

MarcusKlik commented 6 years ago

That's already a nice sales pitch you have there :-)