Open MarcusKlik opened 6 years ago
dplyr also implements some operations internally ("hybrid evaluation"), maybe there's potential for code reuse?
Is it possible to add the ability to write empty data frame to your milestones? It’s a very useful feature, at least for myself, thanks.
Hi @krlmlr, that's a very nice feature! And yes, I think that it will greatly speed up processing to be able to use the fst
internal representation for common methods (a logical vector takes only 2 bits per element in the internal format for example).
The interface will have to be build around an offline-object representing a fst
file. All operations should be performed on that abstract object so that they can be parallelized automatically. For summarise
that could be done by processing groups in parallel. But you can't call custom R
methods from any other thread than the master thread. That means that data could be loaded from disk in the background, but processing would still be single threaded. Perhaps we could use the hybrid evaluation to replace known methods by multi-threaded ones, that would be great.
Thanks for the idea!
Hi @shrektan, thanks for your question! Writing empty tables is on the list for release v0.8.6. A discussion on that can also be found in #99.
Indeed, it would be nice to be able to use the empty table to store metadata. Also, when row binding is supported, you could have a situation where an empty table would be the first result of an apply
call for example.
That’s great. Also, I’m so exited to see the idea of the data.table interface feature. If that’s possible, I can only say a big WOW. Looking forward to that.
github.com/xiaodaigh/disk.frame
i am experimenting with a data.table-like on-disk data manipulation package bacled by fst. early signs promising
On 9 Jan. 2018 11:59 pm, "Xianying Tan" notifications@github.com wrote:
That’s great. Also, I’m so exited to see the idea of the data.table interface feature. If that’s possible, I can only say a big WOW. Looking forward to that.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fstpackage/fst/issues/117#issuecomment-356277127, or mute the thread https://github.com/notifications/unsubscribe-auth/AESfJSWjIrXPeYLEwqqoLIEyUUOQ5zGCks5tI2K3gaJpZM4RLfcy .
Hi @xiaodaigh, your package is very impressive! Would it be possible to run the disk.frame
interface against a parallel cluster (with each node processing some chunks)?
One of the bottlenecks if you have a lot of data is that you can use many threads to read data from file (for example with fst
) but running (custom)R
methods always has to be done on the (single) main thread. Combining a cluster with a fst
back-end would give you the option to tune where you need the the most CPU power 👍
Firstly it's a great compliment from the creator of fst
! I wouldn't call my package impressive if the bar is fst
. Anyway it's still early stages but thanks!
@MarcusKlik > Would it be possible to run the disk.frame interface against a parallel cluster (with each node processing some chunks)?
I think it's possible and probably will be done at some point. One thing I am not certain about is that I am not sure how working with clusters in R will be like. I know Julia was designed to work well with clusters so I was thinking about building something in Julia and make it call R (hopefully Julia's native DataFrame will catch up in performance and it can get a native fst
reader).
Crazy idea: we can a start a company making a fst
-backed distributed data manipulation tool that maximizes single machine performance as well.
That's already a nice sales pitch you have there :-)
The currently planned features planned for
fst
:version 0.8.4:
Intermediate release to fix the Clang 6.0 build errors (#118)
version 0.8.6:
character
columnsdata.table
interface:Note that there is no grouping functionality in this basic interface (yet), but there will be:
Basic
dplyr
interface:filter
select
slice
collect
sample_n
andsample_frac
(only needs memory for data in the returned sample)Hashing of column data
version 0.8.8:
For the
data.table
interface::=
for column binding to an existingfst
filerbindlist
to row bind multiplefst
files into a single fileFor the
dplyr
interface:add_row
add_column
mutate
version 0.8.8 and later
Later features (in random order):
Add on-disk grouping functionality. That requires on-disk sorting, which can be done using a
merge sort
algorithm.lapply
like functionality creating afst
file using a list of inputs (csv's, custom methods, etc.)interoperability: a) import data from Apache Parguet files b) Python interface c) C++ interface library d) Julia interface, ...
advanced operations: a) Parallel grouping for specific methods (like +,-,*,/,sum,mean, etc. these methods need a C++ implementation for parallel operations) b) binary search on table key columns (extremely fast sub-setting of a key range) c) Merge operations on multiple
fst
files (right join to start with, like indata.table
) d) multiplefst
-files represent a single data set e) set offst
-files can be sorted in parallel into a new set offst
files. This avoids the slow end-phase of sorting algorithms like merge sort. f) user-defined map-reduce operations that can be used on thefst
file(s) in parallel. Simple example: a custom median method using 1) sum and count each chunk 2) take results from 1) to calculate median. g) fill a data set range with specific rows from afst
file, overwriting data in-memory (#29).performance and security enhancements:
a) encryption b) SIMD upgrades to the bit-shifters and pre-serialization filters used in
fst
c) a plug-in system (C++) for custom compressors to allow users to come up with faster or better compressors d) better character columns compression e) high compression mode for slow IO (network) speeds (#23).This list is subject to a lot of change depending on features and issues requested/reported by users of the
fst
package :-)