Is there any indexing possible

Courvoisier13 commented 4 years ago

Hi,

I looked and I couldn't find if there was an issue about this before. But is there a possibility or a plan to create indexes to fst files to make searching and filtering faster. It would make the fst::fst function so much more powerful and would avoid having to create folders and splitting the files to make access to a subset faster. There are issues talking about filtering, selecting and speed but nothing on indexing. I think that would be a game changer.

What an incredibly useful package. Please keep it up.

MarcusKlik commented 4 years ago

Hi @Courvoisier13, yes, the option to add indices to a fst file would be great!

To the user, these indices could work like data.table's secondary indices and can be stored in two or more extra columns for each index (the keys and corresponding row numbers).

Because the tabular data is on-disk rather than in-memory, the design should be a bit different from the in-memory data.table I think (just thinking out loud here :-))

A fst file can be read-only. If the user still wants to compute indices, these must be stored in a separate (fst) file, in some user accessible location. If the user does have write access, it might still be better to store the index separately because in that case the index can be removed more easily (without making a full copy of the original file).
For optimal performance, data in the fst file must be accessed 'as sequentially as possible'. Therefore, we need an extra sorting step after the index is read (see below).

So, when the user does:

data.frame(X = 1:10, Y = sample(LETTERS, 10)) %>%
  write_fst("some_data.fst")

# write an index to file 'some_data.ifst1'
ft <- fst("some_data.fst") %>%
  arrange(X)

# subset using index
ft %>%
  filter(X > 3 & X < 7)

an extra file can be created (_somedata.ifst1) were the index is stored (2 columns in this case). Subsequent indices would create additional files. When a subset is requested from the fst file, a binary search is performed on the index file to get a vector of row indexes corresponding to the required subset.

But after this step, the result must be sorted to ensure that these rows can be read sequentially. So instead of using a row vector, we need a reverse-row vector, mapping the original rows to the target rows. The sorting must be done in-parrallel to the reading for large subsets.

The good thing is that the original fst file will be preserved and can be copied with or without the index files. And, as you say, subsetting like this will be very fast for small subsets!

Additionally, if we have a good mechanism for indexing, grouping could be a logical next step. Then, grouping operations could be performed without actually reordering data in the fst file, which would also be a very powerful feature I think.

thanks for your feature request!

Courvoisier13 commented 4 years ago

Thanks @MarcusKlik for the detailed analysis. The way I understand it is some sort of index "light" to allow for easy filtering on those indices, and next step would be grouping. Looking forward to your updates :)

jpcirrus commented 3 years ago

First, many thanks for fst. Now an essential part of my R workflow.

Would this mean that data.table secondary indices would be preserved between fst writes and reads, as with the current data.table key?

liorso commented 2 years ago

Hi @MarcusKlik, I would like to help with this feature, I think it will be very helpful. I think the first step should be supporting (by fstcore) retrieving rows by a vector of rows instead of a range of rows. Would like to hear if you already thought about it and any other advice.

fstpackage / fst

Is there any indexing possible #239