CodeLionX / actordb

Actor Database System Framework using Akka
MIT License
0 stars 2 forks source link

Memory Experiments v2 #113

Closed CodeLionX closed 6 years ago

CodeLionX commented 6 years ago

Proposed Changes

Use this PR to discuss results and optimizations

Related

CodeLionX commented 6 years ago

Theoretical Analysis of Memory Overhead of RowRelation

Blog post talking about memory consumption of different scala/java collections by Haoyi (you should really know him when using scala :wink: ): http://www.lihaoyi.com/post/BenchmarkingScalaCollections.html He writes:

Small Vectors have 3-5x as much memory overhead as small Arrays: a 1-element vector takes a 1/5th of a kilobyte of memory!

For size 16 the overhead shrinks to ~30%, and for size 1,048,576 the overhead is down to about 5%. While the internal machinery means small Vectors use memory wastefully, for larger Vectors the overhead compared to a (boxed) Array [scala.immutable.Array[java.lang.Object]] is negligible.

  • Annotations by me in []

Table: Memory use of Arrays shows the difference of array sizes when using various data types.

The memory usage of the various unboxed Arrays of primitives is an order of magnitude less than that of the boxed Array[Object]. This makes sense, as a Array[Object] needs to store a 4-byte pointer to each entry, which itself is an object with a 16-byte object header [...].

  1. we use a lot of small Vectors inside of Relation, because we have Vector[Vector[Object]] (n x m, n: #rows in relation, m: #cols in relation, usually we have n Vector[Object] with m < 16 :arrow_right: ~4n x overhead).
  2. we use Vector[Any] == Vector[java.lang.Object], which takes a 4 Byte pointer for each entry regardless of the real data type

:arrow_right: Vector[Vector[Any]] is in terms of memory a bad container :arrow_right: Vector[Any] with size nm and custom indexing should be much much better (only reasonable slight overhead to Array[Object]), but needs a lot of thought for inserts and updates :arrow_right: Array[Array[Any]] could reduce memory overhead by a factor of 3-5 :arrow_right: column-wise storage should be much better as type info could be used to store data in Array of Array[<primitive type>]s with m x n

CodeLionX commented 6 years ago

Practical Analysis of Memory Overhead of RowRelation

Using medium dataset generated with default values of generateData.sh-script. Every size is reported in Bytes.

Datatype Instances Instance Size Total Size Retained Size
StringHolder instance String 1 24 24 15 407 388
Whole Heap Dump String 71 250 20 783 833 20 783 833
RowRelation instances Seq[Seq[Any]] 4 808 40 192 320 118 380 768
Whole Heap Dump Seq[Seq[Any]] 1 837 776 124 267 137 124 267 137
RowRelation instances Vector[Vector[Any]] 4 808 40 192 320 114 578 447
Whole Heap Dump Vector[Vector[Any]] 1 624 204 120 494 473 120 494 473
RowRelation instances Array[Array[Any]] 4 808 40 192 320 42 412 251
Whole Heap Dump Array[Array[Any]] 1 391 541 51 201 296 51 201 296
Dactor instances Dactor with Array[Array[Any]] 2 357 Ø 85 200 276 43 681 538
Whole Heap Dump Dactor with Array[Array[Any]] 1 464 513 52 557 029 52 557 029

:arrow_right: No big difference between Seq and Vector as Vector is the default implementation of Seqs and they are small wrappers around it (see # instances of whole heap dump) :arrow_right: RowRelations with [Vector[Vector[Any]] need nearly 3x more memory than with Array[Array[Any]] :arrow_right: Only 3x more space required for RowRelations with Array datatype than for simple String containing all data :arrow_right: For the whole Heap Dump, we can reduce the heap size by factor 2. :arrow_right: 2 357 Dactors only increase the size for storing the data in memory by roughly 1MB

Test Procedure

srfc commented 6 years ago

Big 👍 on your analysis and workflow description for measuring the retained sizes of only our RowRelation instances.

The improvement of using Array[Array[Any]] is substantial, but, as you've said in the first post, I think using a column-oriented store internally to be able to make use specific types of a column will improve memory usage even more. I think this is out of scope for the (very) immediate future/present, but I think this should be on our roadmap Relation and memory-wise.

Does using arrays instead of vectors remove any functionality that we rely on at the moment? I don't think so, but want to make sure since you rewrote the internal storage of RowRelation.

Thanks for looking into this and for the blog resource, bookmarked it already!

CodeLionX commented 6 years ago

The change from Vector to Array did not change the code inside our Relations significantly as you can see in PR #115 :sunglasses:

Should we create an issue regarding the change to column-oriented storage?

CodeLionX commented 6 years ago

I uploaded the heap dumps to Drive (click to view). They can be viewed with VisualVM:

File > Load > (Select Heap Dump as File type) > Select File > Open

It is very interesting; you can even see the individual overhead of Akka Actors of the different Dactor-types.

CodeLionX commented 6 years ago

merging without review as we already discussed the results and the approach multiple times