Closed CodeLionX closed 6 years ago
RowRelation
Blog post talking about memory consumption of different scala/java collections by Haoyi (you should really know him when using scala :wink: ): http://www.lihaoyi.com/post/BenchmarkingScalaCollections.html He writes:
Small
Vector
s have 3-5x as much memory overhead as smallArray
s: a 1-element vector takes a 1/5th of a kilobyte of memory!For size 16 the overhead shrinks to ~30%, and for size 1,048,576 the overhead is down to about 5%. While the internal machinery means small
Vector
s use memory wastefully, for largerVector
s the overhead compared to a (boxed)Array
[scala.immutable.Array[java.lang.Object]
] is negligible.
- Annotations by me in
[]
Table: Memory use of Arrays shows the difference of array sizes when using various data types.
The memory usage of the various unboxed Arrays of primitives is an order of magnitude less than that of the boxed Array[Object]. This makes sense, as a Array[Object] needs to store a 4-byte pointer to each entry, which itself is an object with a 16-byte object header [...].
Vector
s inside of Relation
, because we have Vector[Vector[Object]]
(n x m, n: #rows in relation, m: #cols in relation, usually we have n Vector[Object]
with m < 16 :arrow_right: ~4n x overhead).Vector[Any]
== Vector[java.lang.Object]
, which takes a 4 Byte pointer for each entry regardless of the real data type:arrow_right: Vector[Vector[Any]]
is in terms of memory a bad container
:arrow_right: Vector[Any]
with size nm and custom indexing should be much much better (only reasonable slight overhead to Array[Object]
), but needs a lot of thought for inserts and updates
:arrow_right: Array[Array[Any]]
could reduce memory overhead by a factor of 3-5
:arrow_right: column-wise storage should be much better as type info could be used to store data in Array
of Array[<primitive type>]
s with m x n
RowRelation
Using medium dataset generated with default values of generateData.sh
-script. Every size is reported in Bytes.
Datatype | Instances | Instance Size | Total Size | Retained Size | |
---|---|---|---|---|---|
StringHolder instance |
String |
1 | 24 | 24 | 15 407 388 |
Whole Heap Dump | String |
71 250 | 20 783 833 | 20 783 833 | |
RowRelation instances |
Seq[Seq[Any]] |
4 808 | 40 | 192 320 | 118 380 768 |
Whole Heap Dump | Seq[Seq[Any]] |
1 837 776 | 124 267 137 | 124 267 137 | |
RowRelation instances |
Vector[Vector[Any]] |
4 808 | 40 | 192 320 | 114 578 447 |
Whole Heap Dump | Vector[Vector[Any]] |
1 624 204 | 120 494 473 | 120 494 473 | |
RowRelation instances |
Array[Array[Any]] |
4 808 | 40 | 192 320 | 42 412 251 |
Whole Heap Dump | Array[Array[Any]] |
1 391 541 | 51 201 296 | 51 201 296 | |
Dactor instances |
Dactor with Array[Array[Any]] |
2 357 | Ø 85 | 200 276 | 43 681 538 |
Whole Heap Dump | Dactor with Array[Array[Any]] |
1 464 513 | 52 557 029 | 52 557 029 |
:arrow_right: No big difference between Seq
and Vector
as Vector
is the default implementation of Seq
s and they are small wrappers around it (see # instances of whole heap dump)
:arrow_right: RowRelation
s with [Vector[Vector[Any]]
need nearly 3x more memory than with Array[Array[Any]]
:arrow_right: Only 3x more space required for RowRelation
s with Array
datatype than for simple String
containing all data
:arrow_right: For the whole Heap Dump, we can reduce the heap size by factor 2.
:arrow_right: 2 357 Dactor
s only increase the size for storing the data in memory by roughly 1MB
RowRelation
with a double-clickRowRelation
instances and their membersBig 👍 on your analysis and workflow description for measuring the retained sizes of only our RowRelation
instances.
The improvement of using Array[Array[Any]]
is substantial, but, as you've said in the first post, I think using a column-oriented store internally to be able to make use specific types of a column will improve memory usage even more. I think this is out of scope for the (very) immediate future/present, but I think this should be on our roadmap Relation
and memory-wise.
Does using arrays instead of vectors remove any functionality that we rely on at the moment? I don't think so, but want to make sure since you rewrote the internal storage of RowRelation
.
Thanks for looking into this and for the blog resource, bookmarked it already!
The change from Vector
to Array
did not change the code inside our Relation
s significantly as you can see in PR #115 :sunglasses:
Should we create an issue regarding the change to column-oriented storage?
I uploaded the heap dumps to Drive (click to view). They can be viewed with VisualVM:
File > Load > (Select Heap Dump as File type) > Select File > Open
It is very interesting; you can even see the individual overhead of Akka Actors of the different Dactor
-types.
merging without review as we already discussed the results and the approach multiple times
Proposed Changes
Use this PR to discuss results and optimizations
Related