Memory Experiments v2 - Githubissues

CodeLionX commented 6 years ago

Proposed Changes

Add Memory Benchmark Code
Add Memory Benchmark Results

Use this PR to discuss results and optimizations

Small Vectors have 3-5x as much memory overhead as small Arrays: a 1-element vector takes a 1/5th of a kilobyte of memory!

For size 16 the overhead shrinks to ~30%, and for size 1,048,576 the overhead is down to about 5%. While the internal machinery means small Vectors use memory wastefully, for larger Vectors the overhead compared to a (boxed) Array [scala.immutable.Array[java.lang.Object]] is negligible.

Annotations by me in []

Table: Memory use of Arrays shows the difference of array sizes when using various data types.

The memory usage of the various unboxed Arrays of primitives is an order of magnitude less than that of the boxed Array[Object]. This makes sense, as a Array[Object] needs to store a 4-byte pointer to each entry, which itself is an object with a 16-byte object header [...].

we use a lot of small Vectors inside of Relation, because we have Vector[Vector[Object]] (n x m, n: #rows in relation, m: #cols in relation, usually we have n Vector[Object] with m < 16 :arrow_right: ~4n x overhead).
we use Vector[Any] == Vector[java.lang.Object], which takes a 4 Byte pointer for each entry regardless of the real data type

:arrow_right: Vector[Vector[Any]] is in terms of memory a bad container :arrow_right: Vector[Any] with size nm and custom indexing should be much much better (only reasonable slight overhead to Array[Object]), but needs a lot of thought for inserts and updates :arrow_right: Array[Array[Any]] could reduce memory overhead by a factor of 3-5 :arrow_right: column-wise storage should be much better as type info could be used to store data in Array of Array[<primitive type>]s with m x n

CodeLionX commented 6 years ago

Practical Analysis of Memory Overhead of `RowRelation`

Using medium dataset generated with default values of generateData.sh-script. Every size is reported in Bytes.

	Datatype	Instances	Instance Size	Total Size	Retained Size
`StringHolder` instance	`String`	1	24	24	15 407 388
Whole Heap Dump	`String`	71 250		20 783 833	20 783 833

`RowRelation` instances	`Seq[Seq[Any]]`	4 808	40	192 320	118 380 768
Whole Heap Dump	`Seq[Seq[Any]]`	1 837 776		124 267 137	124 267 137
`RowRelation` instances	`Vector[Vector[Any]]`	4 808	40	192 320	114 578 447
Whole Heap Dump	`Vector[Vector[Any]]`	1 624 204		120 494 473	120 494 473
`RowRelation` instances	`Array[Array[Any]]`	4 808	40	192 320	42 412 251
Whole Heap Dump	`Array[Array[Any]]`	1 391 541		51 201 296	51 201 296

`Dactor` instances	`Dactor` with `Array[Array[Any]]`	2 357	Ø 85	200 276	43 681 538
Whole Heap Dump	`Dactor` with `Array[Array[Any]]`	1 464 513		52 557 029	52 557 029

:arrow_right: No big difference between Seq and Vector as Vector is the default implementation of Seqs and they are small wrappers around it (see # instances of whole heap dump) :arrow_right: RowRelations with [Vector[Vector[Any]] need nearly 3x more memory than with Array[Array[Any]] :arrow_right: Only 3x more space required for RowRelations with Array datatype than for simple String containing all data :arrow_right: For the whole Heap Dump, we can reduce the heap size by factor 2. :arrow_right: 2 357 Dactors only increase the size for storing the data in memory by roughly 1MB

Test Procedure

use VisualVM to monitor Benchmark execution
after load finished, perform GC
Create Heap Dump
navigate to Classes and select RowRelation with a double-click
let VisualVM calculate the Retained Sizes (this will take awhile relative to the heap size)
the retained size corresponds to the total memory used by all RowRelation instances and their members

srfc commented 6 years ago

Big 👍 on your analysis and workflow description for measuring the retained sizes of only our RowRelation instances.

The improvement of using Array[Array[Any]] is substantial, but, as you've said in the first post, I think using a column-oriented store internally to be able to make use specific types of a column will improve memory usage even more. I think this is out of scope for the (very) immediate future/present, but I think this should be on our roadmap Relation and memory-wise.

Does using arrays instead of vectors remove any functionality that we rely on at the moment? I don't think so, but want to make sure since you rewrote the internal storage of RowRelation.

Thanks for looking into this and for the blog resource, bookmarked it already!

CodeLionX commented 6 years ago

The change from Vector to Array did not change the code inside our Relations significantly as you can see in PR #115 :sunglasses:

Should we create an issue regarding the change to column-oriented storage?

CodeLionX commented 6 years ago

I uploaded the heap dumps to Drive (click to view). They can be viewed with VisualVM:

File > Load > (Select Heap Dump as File type) > Select File > Open

It is very interesting; you can even see the individual overhead of Akka Actors of the different Dactor-types.

CodeLionX commented 6 years ago

merging without review as we already discussed the results and the approach multiple times

CodeLionX / actordb

Memory Experiments v2 #113

Proposed Changes

Use this PR to discuss results and optimizations

Related

Theoretical Analysis of Memory Overhead of `RowRelation`

Practical Analysis of Memory Overhead of `RowRelation`

Test Procedure

CodeLionX / actordb

Memory Experiments v2 #113

Proposed Changes

Use this PR to discuss results and optimizations

Related

Theoretical Analysis of Memory Overhead of RowRelation

Practical Analysis of Memory Overhead of RowRelation

Test Procedure

Theoretical Analysis of Memory Overhead of `RowRelation`

Practical Analysis of Memory Overhead of `RowRelation`