fstpackage / fsttable

An interface to fast on-disk data tables stored with the fst format
GNU Affero General Public License v3.0
27 stars 4 forks source link

Is this package still being maintained? #44

Open waynelapierre opened 1 year ago

waynelapierre commented 1 year ago

Seems like a great package for handling large datasets.

xiaodaigh commented 1 year ago

diskframe.com and arrow can handle large datasets. I haven't looked at arrow recently though.

MarcusKlik commented 1 year ago

Hi @waynelapierre, thanks for asking! Yes, with diskframe you can handle larger-than-memory datasets if that's what you need. To include this functionality in the fstlib library (the C++ backend of fst), and to make fst work more like a real database, I had a couple of ideas for this package that might be worth exploring:

The big advantage of using folders instead of single files for general operation is that on-disk sorting and merging requires storage of temporary (fst) files. Also, operations like row binding or adding columns can be done on multiple files without the need to physically copy data. And with multiple files we can have more threads working on IO, which would speed-up read- and write- times (and this should work even if one of the arguments (of a merge for example) is an in-memory table).

These are just some ideas which could speed-up fst when faster PCIe 5.0 SSD's will hit the market later this year and could solve some feature requests on fst that cannot really be solved effectively with single file datasets 😸