eqlabs / pathfinder

A Starknet full node written in Rust
https://eqlabs.github.io/pathfinder/
Other
622 stars 231 forks source link

Storage: reduce transaction size #1382

Closed Mirko-von-Leipzig closed 4 months ago

Mirko-von-Leipzig commented 1 year ago

Our database size is growing far too large. Our transactions table starknet_transactions consumes a whopping 20GB on mainnet alone.

Transaction data is currently stored as zstd compressed json. We should be able to do much better using a binary format of some kind, maybe bincode?

As a first step, it would be good to measure the size differences between cadidate encodings. One way of achieving this is to read in-and-convert each transaction + receipt, and totalling the encoded byte sizes.

If an encoding is a clear winner, we then need to add a database migration to perform this re-encoding. Another option if the migration is too costly is to add a version column to the table which determines which encoding the data is using. This would allow new data to use the new encoding, and old data just remains in place.

CHr15F0x commented 1 year ago

Could be useful: https://www.lucidchart.com/techblog/2019/12/06/json-compression-alternative-binary-formats-and-compression-methods/

Mirko-von-Leipzig commented 1 year ago

The big thing that the json idea misses is that we have bigints that we are representing as text in json. I'm fairly sure this has horrible costs even though the result is "highly compressed".

pierre-l commented 1 year ago

The bincode and MessagePack implementations are not 100% compatible with serde and failed at deserialization during my tests.

Mirko-von-Leipzig commented 1 year ago

The bincode and MessagePack implementations are not 100% compatible with serde and failed at deserialization during my tests.

Could you elaborate on what failed? I imagine the transaction types themselves are probably also not great for this.

pierre-l commented 1 year ago

Bincode apparently just doesn't support tagged enums https://github.com/bincode-org/bincode/issues/548 I assume it's a similar issue with MessagePack. I haven't digged deeper.

One good candidate I've found while staying in the "serde-compatible and schemaless" field is flexbuffers. Here are the results: json/zstd avg size: 1124 flex/zstd avg size: 774 31% reduction in size, 1M tx+receipt pairs.

Mirko-von-Leipzig commented 1 year ago

We don't need to support tagged if we don't want to. To explain a bit

The current structure of the transaction enum / structs are based on the json schema we receive from the starknet feeder gateway. We re-use this type in many different places, which is not great as effectively our codebase is coupled to the gateway schema. This is particularly bad for storage as this means a schema change in the gateway will cause our existing dbs to misalign.

A better solution is to have common type which has no serde restrictions at all and is friendlier to use (read: less nested nonsense with tags).

I started something more flattened like this in #1209 (crates/common/src/transaction.rs). Maybe copy those types and see if you can't get those to work instead? The type conversions should also be implemented in that PR somewhere.

pierre-l commented 1 year ago

The fact I only noticed the issues upon deserialization worries me. We could insert data in DB that we later can't deserialize. It doesn't inspire confidence. Flexbuffers' performance size-wise is really close to bincode & MessagePack. It's also 3 times faster than JSON ser/de. The only concern I have with flexbuffers is that the crate was published 2 years ago.

I'll take a look at #1209.

Mirko-von-Leipzig commented 1 year ago

The fact I only noticed the issues upon deserialization worries me. We could insert data in DB that we later can't deserialize. It doesn't inspire confidence.

Yeah that is worrisome.

Something I've also only just realized is that our Felt defaults to string encoding regardless of what you try and do here. That encoding will be pretty pervasive and difficult to remove from our codebase in one go. So if you go for copying #1209 it would be good to replace Felt with just a [u8;32] and strip the leading zeros during encoding. This will probably give a massive bump for all metrics.

pierre-l commented 1 year ago

Results with the common types from #1209: json/zstd avg size: 1124 -> 961 flex/zstd avg size: 774 -> 732

So a good improvement with json, a minor one with flexbuffers.

@Mirko-von-Leipzig Could you show me an example of where Felt is used and could be replaced?

Mirko-von-Leipzig commented 1 year ago

Results with the common types from #1209: json/zstd avg size: 1124 -> 961 flex/zstd avg size: 774 -> 732

So a good improvement with json, a minor one with flexbuffers.

@Mirko-von-Leipzig Could you show me an example of where Felt is used and could be replaced?

Its the base starknet type -- almost every field will be of Felt under the hood 😅 To prevent accidentily passing the wrong properties around, we make extensive use of newtype wrappers around Felt using a macro or two.

https://github.com/eqlabs/pathfinder/blob/main/crates/common/src/lib.rs#L401-L438

Probably the easiest way to test this initially is to replace all the felt wrappers with just a single felt equivalent like:

struct Felt2([u8;32])

This wouldn't be the way to approach this in production, but it should give us numbers at least.

Most felt values represent hashes of some sort. Since these are random you can expect the full 32 bytes to always be used. However some felt types represent actual values set by users and these tend to only use a couple of bytes. A further optimisation would then be to identify any such felt types and encode those as only the bytes used. But I think that can probably wait.

Or maybe a better way is to just go straight away with

struct Felt2(Vec<u8>)

for all felts, and then just iterate and skip all leading zero bytes when doing the conversions..

pierre-l commented 1 year ago

I realized yesterday I mixed up flexbuffers and MessagePack, making the former more promising that it really is. So I ran everything again and added a custom serializer as well as a flexible deserializer accepting both a string and a sequence of bytes.

Crate Compression Common + trimmed bytes serializer Common + bytes serializer Common Gateway Gateway + trimmed bytes serializer Gateway + bytes serializer
serde_json zstd 1124 1163 960 1123 1211 1228
serde_json raw 2337 2983 1830 2136 2399 2565
serde_json gz 1123 1168 969 1129 1209 1228
serde_json lz4 1764 1798 1460 1692 1878 1895
flexbuffers* zstd 1091 1120 1176 1347 1342 1353
flexbuffers* raw 1479 1812 1913 2200 1993 2077
flexbuffers* gz 1113 1168 1197 1373 1363 1375
flexbuffers* lz4 1252 1298 1618 1846 1679 1696
rmp_serde* zstd 688 705 732 774 797 809
rmp_serde* raw 1025 1349 1418 1489 1293 1376
rmp_serde* gz 720 772 752 800 831 842
rmp_serde* lz4 822 856 1162 1234 1078 1095
bincode* zstd 715 713 739 795 825 829
bincode* raw 1311 1635 1715 1864 1666 1749
bincode* gz 772 774 745 813 854 861
bincode* lz4 883 886 1204 1300 1150 1154
bson* zstd 1050 1043 1073 1239 1266 1265
bson* raw 1623 1947 2027 2371 2172 2255
bson* gz 1090 1090 1085 1252 1288 1290
bson* lz4 1256 1263 1586 1814 1664 1664
ciborium* zstd 900 928 934 1093 1130 1144
ciborium* raw 1299 1634 1703 1970 1772 1858
ciborium* gz 930 983 949 1108 1153 1167
ciborium* lz4 1050 1098 1400 1612 1453 1474

So size-wise, bincode and MessagePack are still the most promising, but there is no silver bullet. Important note: I had deserialization problems with all of these but serde_json. These might of my own doing, but I find it odd that serde_json is the only one that gave me zero trouble.

Compression-wise, no silver bullet either.

Serializing as bytes doesn't help at all with JSON for fairly obvious reasons. I would have though compression would have erased that difference but it apparently doesn't.

Trimming the leading zeroes sometimes produce the opposite effect than expected. Maybe it messes with byte alignments?

This is overall inconclusive. Switching to another format will not only be cumbersome, it will imply risks I'm not comfortable to take. Whatever solution we opt for, a ~45% reduction in size doesn't look appealing enough to me.

The only thing I would like us to go forward with is the new common types as they bring other advantages.

kkovaacs commented 4 months ago

I guess we could close this now that we've implemented bincode and some other optimizations?