Storage: reduce transaction size

Mirko-von-Leipzig commented 1 year ago

Our database size is growing far too large. Our transactions table starknet_transactions consumes a whopping 20GB on mainnet alone.

Transaction data is currently stored as zstd compressed json. We should be able to do much better using a binary format of some kind, maybe bincode?

As a first step, it would be good to measure the size differences between cadidate encodings. One way of achieving this is to read in-and-convert each transaction + receipt, and totalling the encoded byte sizes.

If an encoding is a clear winner, we then need to add a database migration to perform this re-encoding. Another option if the migration is too costly is to add a version column to the table which determines which encoding the data is using. This would allow new data to use the new encoding, and old data just remains in place.

CHr15F0x commented 1 year ago

Could be useful: https://www.lucidchart.com/techblog/2019/12/06/json-compression-alternative-binary-formats-and-compression-methods/

Mirko-von-Leipzig commented 1 year ago

The big thing that the json idea misses is that we have bigints that we are representing as text in json. I'm fairly sure this has horrible costs even though the result is "highly compressed".

pierre-l commented 1 year ago

The bincode and MessagePack implementations are not 100% compatible with serde and failed at deserialization during my tests.

Mirko-von-Leipzig commented 1 year ago

The bincode and MessagePack implementations are not 100% compatible with serde and failed at deserialization during my tests.

Could you elaborate on what failed? I imagine the transaction types themselves are probably also not great for this.

pierre-l commented 1 year ago

Bincode apparently just doesn't support tagged enums https://github.com/bincode-org/bincode/issues/548 I assume it's a similar issue with MessagePack. I haven't digged deeper.

One good candidate I've found while staying in the "serde-compatible and schemaless" field is flexbuffers. Here are the results: json/zstd avg size: 1124 flex/zstd avg size: 774 31% reduction in size, 1M tx+receipt pairs.

Mirko-von-Leipzig commented 1 year ago

We don't need to support tagged if we don't want to. To explain a bit

The current structure of the transaction enum / structs are based on the json schema we receive from the starknet feeder gateway. We re-use this type in many different places, which is not great as effectively our codebase is coupled to the gateway schema. This is particularly bad for storage as this means a schema change in the gateway will cause our existing dbs to misalign.

A better solution is to have common type which has no serde restrictions at all and is friendlier to use (read: less nested nonsense with tags).

I started something more flattened like this in #1209 (crates/common/src/transaction.rs). Maybe copy those types and see if you can't get those to work instead? The type conversions should also be implemented in that PR somewhere.

pierre-l commented 1 year ago

The fact I only noticed the issues upon deserialization worries me. We could insert data in DB that we later can't deserialize. It doesn't inspire confidence. Flexbuffers' performance size-wise is really close to bincode & MessagePack. It's also 3 times faster than JSON ser/de. The only concern I have with flexbuffers is that the crate was published 2 years ago.

I'll take a look at #1209.

Mirko-von-Leipzig commented 1 year ago

The fact I only noticed the issues upon deserialization worries me. We could insert data in DB that we later can't deserialize. It doesn't inspire confidence.

Yeah that is worrisome.

Something I've also only just realized is that our Felt defaults to string encoding regardless of what you try and do here. That encoding will be pretty pervasive and difficult to remove from our codebase in one go. So if you go for copying #1209 it would be good to replace Felt with just a [u8;32] and strip the leading zeros during encoding. This will probably give a massive bump for all metrics.

pierre-l commented 1 year ago

Results with the common types from #1209: json/zstd avg size: 1124 -> 961 flex/zstd avg size: 774 -> 732

So a good improvement with json, a minor one with flexbuffers.

@Mirko-von-Leipzig Could you show me an example of where Felt is used and could be replaced?

Mirko-von-Leipzig commented 1 year ago

Results with the common types from #1209: json/zstd avg size: 1124 -> 961 flex/zstd avg size: 774 -> 732

So a good improvement with json, a minor one with flexbuffers.

@Mirko-von-Leipzig Could you show me an example of where Felt is used and could be replaced?

Its the base starknet type -- almost every field will be of Felt under the hood 😅 To prevent accidentily passing the wrong properties around, we make extensive use of newtype wrappers around Felt using a macro or two.

https://github.com/eqlabs/pathfinder/blob/main/crates/common/src/lib.rs#L401-L438

Probably the easiest way to test this initially is to replace all the felt wrappers with just a single felt equivalent like:

struct Felt2([u8;32])

This wouldn't be the way to approach this in production, but it should give us numbers at least.

Most felt values represent hashes of some sort. Since these are random you can expect the full 32 bytes to always be used. However some felt types represent actual values set by users and these tend to only use a couple of bytes. A further optimisation would then be to identify any such felt types and encode those as only the bytes used. But I think that can probably wait.

Or maybe a better way is to just go straight away with

struct Felt2(Vec<u8>)

for all felts, and then just iterate and skip all leading zero bytes when doing the conversions..

pierre-l commented 1 year ago

I realized yesterday I mixed up flexbuffers and MessagePack, making the former more promising that it really is. So I ran everything again and added a custom serializer as well as a flexible deserializer accepting both a string and a sequence of bytes.

Crate	Compression	Common + trimmed bytes serializer	Common + bytes serializer	Common	Gateway	Gateway + trimmed bytes serializer	Gateway + bytes serializer
serde_json	zstd	1124	1163	960	1123	1211	1228
serde_json	raw	2337	2983	1830	2136	2399	2565
serde_json	gz	1123	1168	969	1129	1209	1228
serde_json	lz4	1764	1798	1460	1692	1878	1895
flexbuffers*	zstd	1091	1120	1176	1347	1342	1353
flexbuffers*	raw	1479	1812	1913	2200	1993	2077
flexbuffers*	gz	1113	1168	1197	1373	1363	1375
flexbuffers*	lz4	1252	1298	1618	1846	1679	1696
rmp_serde*	zstd	688	705	732	774	797	809
rmp_serde*	raw	1025	1349	1418	1489	1293	1376
rmp_serde*	gz	720	772	752	800	831	842
rmp_serde*	lz4	822	856	1162	1234	1078	1095
bincode*	zstd	715	713	739	795	825	829
bincode*	raw	1311	1635	1715	1864	1666	1749
bincode*	gz	772	774	745	813	854	861
bincode*	lz4	883	886	1204	1300	1150	1154
bson*	zstd	1050	1043	1073	1239	1266	1265
bson*	raw	1623	1947	2027	2371	2172	2255
bson*	gz	1090	1090	1085	1252	1288	1290
bson*	lz4	1256	1263	1586	1814	1664	1664
ciborium*	zstd	900	928	934	1093	1130	1144
ciborium*	raw	1299	1634	1703	1970	1772	1858
ciborium*	gz	930	983	949	1108	1153	1167
ciborium*	lz4	1050	1098	1400	1612	1453	1474

So size-wise, bincode and MessagePack are still the most promising, but there is no silver bullet. Important note: I had deserialization problems with all of these but serde_json. These might of my own doing, but I find it odd that serde_json is the only one that gave me zero trouble.

Compression-wise, no silver bullet either.

Serializing as bytes doesn't help at all with JSON for fairly obvious reasons. I would have though compression would have erased that difference but it apparently doesn't.

Trimming the leading zeroes sometimes produce the opposite effect than expected. Maybe it messes with byte alignments?

This is overall inconclusive. Switching to another format will not only be cumbersome, it will imply risks I'm not comfortable to take. Whatever solution we opt for, a ~45% reduction in size doesn't look appealing enough to me.

The only thing I would like us to go forward with is the new common types as they bring other advantages.

kkovaacs commented 4 months ago

I guess we could close this now that we've implemented bincode and some other optimizations?

eqlabs / pathfinder

Storage: reduce transaction size #1382