Open waynelapierre opened 1 year ago
diskframe.com and arrow can handle large datasets. I haven't looked at arrow recently though.
Hi @waynelapierre, thanks for asking! Yes, with diskframe
you can handle larger-than-memory datasets if that's what you need. To include this functionality in the fstlib
library (the C++
backend of fst
), and to make fst
work more like a real database, I had a couple of ideas for this package that might be worth exploring:
fst
files in a separate folderfst
files) is transparent to themdplyr
and / or data.table
interfaces.The big advantage of using folders instead of single files for general operation is that on-disk sorting and merging requires storage of temporary (fst
) files. Also, operations like row binding or adding columns can be done on multiple files without the need to physically copy data. And with multiple files we can have more threads working on IO, which would speed-up read- and write- times (and this should work even if one of the arguments (of a merge for example) is an in-memory table).
These are just some ideas which could speed-up fst
when faster PCIe 5.0 SSD's will hit the market later this year and could solve some feature requests on fst
that cannot really be solved effectively with single file datasets 😸
Seems like a great package for handling large datasets.