LionWeb-io / lionweb-repository

Reference implementation of LionWeb repository
Apache License 2.0
2 stars 1 forks source link

Consider common values in optimizations #87

Open ftomassetti opened 3 months ago

ftomassetti commented 3 months ago

After loading a big project (2.5M nodes) I looked at the table and I noticed some values being common.

In the reference table the value {} (no values references) is present in 4475306 out of 5207880 records (86%). In the containment table the value {} (no values references) is present in 359166 out of 4514084 records (22%). In the property table the value false is present in 538840 out of 2490897 records (8%).

Complessively, among these 3 tables, 5373312 records out of 12212861 (44%) have the default value. Is there a way to use this to our advantage? Can we avoid adding records when the default value is present? Can we otherwise speed up insertion?

ftomassetti commented 3 months ago

Also, in my cases there are no annotations, so the field annotations is always {}

ftomassetti commented 3 months ago

I also wonder if we should use a database id for nodes, to avoid using the node_id field, which may be a long string (at least in my case. My nodes ID range from 11 to 538 characters, with a mean of 206)

enikao commented 3 months ago

I also wonder if we should use a database id for nodes, to avoid using the node_id field, which may be a long string (at least in my case. My nodes ID range from 11 to 538 characters, with a mean of 206)

I implemented this in some of my C# optimizations: During serialization, I want to detect duplicates. I store SHA1 hashes of the id as a fixed array of 20 bytes. It saves quite some memory (and also speeds up comparison).

Edit: See here: https://github.com/LionWeb-io/lionweb-csharp/blob/niko/serializer-handler/src/LionWeb-CSharp/core/M1/CompressedId.cs

enikao commented 3 months ago

I think there are two ways of optimizing this, both can be applied independently:

a) We can handle default values differently in our logic, so they never reach the DB. b) I could imagine the DB has a way to "compress" column values internally, maybe this needs "column-mode" tables. I don't know any specifics on PQ, just that such features exist in general.