Make full selections of fsttable return identical tables

fstpackage / fsttable

An interface to fast on-disk data tables stored with the fst format

GNU Affero General Public License v3.0

27 stars 4 forks source link

Make full selections of fsttable return identical tables #34

Closed martinblostein closed 6 years ago

martinblostein commented 6 years ago

This addresses #33, increases efficiency, and keeps a NULL slice map for fsttables that map to entire tables on disk.

MarcusKlik commented 6 years ago

Hi @martinblostein, thanks for the merge request, nice work!

For performance reasons, I think it would be better to add a few C methods to the package. For example, the checks now done with methods all (here) and identical / seq (here) are relatively slow and take memory copies. If we want to keep the memory requirements low (and the speed high) we can do those checks in C without any overhead.

But as we said before, functionality first, performance tuning later ! :-)

martinblostein commented 6 years ago

For example, the checks now done with methods all (here) and identical / seq (here) are relatively slow and take memory copies.

How have you determined this? I can't find any indication in the source or using tracemem that all or identical copy their data.

MarcusKlik commented 6 years ago

Hi @martinblostein, on first sight, the test seems memory efficient, for example:

# clear memory
rm(list = ls())
gc()

# allocate 2 GB
Sys.sleep(5)
pryr::mem_change(i <- 1:5e8)
#> 2 GB

# test equivalence
Sys.sleep(5)
pryr::mem_change(identical(i, seq(1, length(i))))
#> 800 B

The test doesn't seem to require any extra memory. But when you look at a capture of the actual OS memory used:

The little bump on the top is the identical/seq test. The bump reflects the temporary vector generated by seq. When i just fits into memory, things get worse:

To generate the temp vector, the OS has to swap memory using an on-disk page file. At that point, also the CPU has to work much harder:

Apparently the CPU has to work hard to compress memory or to write/compress the page file. To avoid the temp vector generation, we could create a small C method that performs the test without the need of an intermediate...

MarcusKlik commented 6 years ago

Thanks for updating your pull request, it's merged now!

martinblostein commented 6 years ago

Well yes, generating a new vector requires new memory. But and all and identical themselves don't. So the check here doesn't require anything memory-wise. For the other check, I see your point, We could write a new C function or use something like x[1] == 1 && all(diff(x) == 1L). But I haven't profiled this ;) !

edit: Ah, that doesn't help, of course. Time to stop trying to be clever with R, haha.

MarcusKlik commented 6 years ago

Ha @martinblostein, yes, R sure likes his copies :-)

Thanks for the pull request, I submitted a new issue for the C method!