fstpackage / fsttable

An interface to fast on-disk data tables stored with the fst format
GNU Affero General Public License v3.0
27 stars 4 forks source link

Parallel methods and operators #2

Open MarcusKlik opened 6 years ago

MarcusKlik commented 6 years ago

To keep a very low memory footprint, fsttable could use a range of operators and methods that are parallel implementations of their counterparts. For example, p_mult is the parallel implementation of *. When the user specifies these parallel methods in a call to a fsttable's interface, processing is done on multiple threads and during loading of the data.

This will increase the speed significantly, and for interactive use, only small amounts of data need to be loaded from file. For example, because fsttable understands the method p_mult, it knows that it only has to read the first few lines of a fst file to display the results for:

print(ft[, .(X, Y = p_mult(A, B)])

So printing this result interactively requires almost no memory. Also, when the full column Y has to be calculated, multiple threads can perform the operation.

For custom methods, this can't be done, because it's unknown whether they calculate per element or need the whole vector, so fsttable always needs to read all columns fully before calling the method. But aggregate methods build from parallel methods could still be used for parallel calculations.

MarcusKlik commented 4 years ago

The tableproxy object should allow for these parallel operations. Perhaps by just specifying a grouping mask and then let the proxy decide if it can proces that in parallel.