Closed martinblostein closed 6 years ago
Hi @martinblostein, thanks for the merge request, nice work!
For performance reasons, I think it would be better to add a few C
methods to the package. For example, the checks now done with methods all
(here) and identical
/ seq
(here) are relatively slow and take memory copies. If we want to keep the memory requirements low (and the speed high) we can do those checks in C without any overhead.
But as we said before, functionality first, performance tuning later ! :-)
For example, the checks now done with methods all (here) and identical / seq (here) are relatively slow and take memory copies.
How have you determined this? I can't find any indication in the source or using tracemem
that all
or identical
copy their data.
Hi @martinblostein, on first sight, the test seems memory efficient, for example:
# clear memory
rm(list = ls())
gc()
# allocate 2 GB
Sys.sleep(5)
pryr::mem_change(i <- 1:5e8)
#> 2 GB
# test equivalence
Sys.sleep(5)
pryr::mem_change(identical(i, seq(1, length(i))))
#> 800 B
The test doesn't seem to require any extra memory. But when you look at a capture of the actual OS memory used:
The little bump on the top is the identical
/seq
test. The bump reflects the temporary vector generated by seq
. When i
just fits into memory, things get worse:
To generate the temp vector, the OS has to swap memory using an on-disk page file. At that point, also the CPU has to work much harder:
Apparently the CPU has to work hard to compress memory or to write/compress the page file. To avoid the temp vector generation, we could create a small C method that performs the test without the need of an intermediate...
Thanks for updating your pull request, it's merged now!
Well yes, generating a new vector requires new memory. But and all
and identical
themselves don't. So the check here doesn't require anything memory-wise. For the other check, I see your point, We could write a new C function or use something like x[1] == 1 && all(diff(x) == 1L)
. But I haven't profiled this ;) !
edit: Ah, that doesn't help, of course. Time to stop trying to be clever with R, haha.
Ha @martinblostein, yes, R
sure likes his copies :-)
Thanks for the pull request, I submitted a new issue for the C
method!
This addresses #33, increases efficiency, and keeps a NULL slice map for fsttables that map to entire tables on disk.