DiskFrame / disk.frame

Fast Disk-Based Parallelized Data Manipulation Framework for Larger-than-RAM Data
https://diskframe.com
Other
594 stars 40 forks source link

Indexing - e.g. bloomfilter #240

Open xiaodaigh opened 4 years ago

xiaodaigh commented 4 years ago

Both issues #211 & #200 seem related to this enhancement, I have a similar problem it that it would be nice to be able to effectively have index columns that can spread across multimple chunks. I my case I have some large datasets where 'sub-shards' would be useful as my groups are too big to practicaly fit in a single chunk. I also have coordinates that it would to be nice to be able to perform a quick check of which chunks the value's i'm trying to look up are in as well as exploiting fst's random access feature to just read out sections of interest based on their indices.

Originally posted by @RichardJActon in https://github.com/xiaodaigh/disk.frame/issues/102#issuecomment-567026839

xiaodaigh commented 4 years ago

Implemented Bloomfilter in #245