Serialization layer as serious bottleneck

remmeier commented 5 years ago

We have been investigating the performance of our code node. Among a great many thing we managed to optimized and achieved some "ok-ish" numbers. A closer look now revealed that about 80% of the performance now goes away in the serialization layer. This is kind of unexpected as I would rather have expected the database, hashing and asymetric crypto to be the main bottlenecks. The situation is aggrevated by the fact that every transaction has a transaction id. This in turn is computed as the hash of all its elements (input states, notaries, output states, time windows, etc.), which each triggers a serialization again.

For testing purposes to get a closer view we made use of:

  private static SerializedBytes serialize(Object object) {
        SerializationFactory defaultFactory = SerializationFactory.Companion.getDefaultFactory();
        SerializationContext defaultContext = defaultFactory.getDefaultContext();
        return defaultFactory.serialize(object, defaultContext);
    }

and serialized a single state with about two dozen fields. A resulting byte array was 3869 bytes long. One CPU core managed to serialize 2800 of those objects every second. If we assume that a great many objects are part of a transaction, then the pictures gets clearer why it takes this amount of time.

To give a reference, we serialized the same object with ObjectMapper from Jackson by first constructing a writer for the desired state type and then measured performance serializing that state object. Jackson managed to serialize 99500 objects every second. A factor 40 compared to AMQP. The json length of the result was 1065. I consider JSON rather ineffient but managed to be 75% smaller than AMQP while still being "standalone" not requiring a external model to deserialize. ProtoBuffer and friends would be another order of magnitude, but at the cast of an external model.

When looking at it with a profiler, ones sees:

grafik

There is heavy work needed to serialize a great number of DescribedTypeElement. A closer look at the implemention shows for example:

val data = Data.Factory.create()
data.withDescribed(Envelope.DESCRIPTOR_OBJECT) {
       withList {
           writeObject(obj, this, context)
           val schema = Schema(schemaHistory.toList())
           writeSchema(schema, this)
           writeTransformSchema(TransformsSchema.build(schema, serializerFactory), this)
      }
}
....

see https://github.com/corda/corda/blob/4dd51de5c1d14901ce143502c21b87ac0863543f/serialization/src/main/kotlin/net/corda/serialization/internal/amqp/SerializationOutput.kt

as a first measure might be to cache the serialization of the schema part to directly get the byte array from a given cached schema history. Maybe providing a decent speed-bump. For a database perspective it may also would proof worthwile to seperate the data and the model seperately, avoiding the redudant storage of the model part.

For Corda applications to move towards more high-volume applications, this ticket feels rather important. Alternatively it would kind of also be nice to see plain JSON support (or something similar). There is widespread support across all devices, easy to read/write, standards how to compute a signature and very performant implementations.

rick-r3 commented 5 years ago

Thanks for the detailed analysis Remo.

You are right, we have much efficiency related performance tuning and space optimisation work left to do in the serialization framework and the surrounding areas of Corda.

Our focus to-date, and in the near term, has been on enabling scalability and overall throughput. Be assured that the optimisations you mention and others are on our list to tackle, once priorities and resources allow.

remmeier commented 4 years ago

I took a deeper look at it in order to fix it. We expect a few ten million records, so performance is critical.

There is a "quick" solution to just make things faster by caching the schema. What I managed to do is instead of serializing the schema, I only put a small player holder into the data structure:

SerializationOutput:

  open fun writeSchema(schema: Schema, data: Data, context: SerializationContext, optimize: Boolean) {
       var placeholder = byteArrayOf(0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11, 0x11)
       data.putBinary(placeholder)
       // data.putObject(schema)
    }

And in a second step I can patch the AMQP structure with the real schema. There are four main elements:

envelop
data
schema
transformation (not seed to be used so far)

It is rather straighforward to find the right place and do that replacement. In our first use case the schema makes up 90% of the serialized bytes. This in turn safes about a factor of 10x of serialization work (with some new, simple array manipulations). As a minor catch, one further has to patch the AMQP ListElement which holds the total size of all its data, which changes due to the placeholder. As further minor complication AMQP stores the length of the schema in the serialized output, which in turn has a variable length coding depending on whether it is larger or smaller than 255 bytes.

Compared to Jackson it will still a bit slow should I gain that factor 10x. But it is not so suprising. For example, that encoding of the size by AMQP makes things more complicate. It has to traverse the complete object graph to compute the size of subelements, making that part similarly expensive as the "real" serialization.

If there is an interest in a PR, I could do that, I'm close to finishing it up for our use case. I'm carefully optimistic to achieve about 1000 tps on a older eight core machine (with further other optimizations) rivaling those official 32/64-core numbers.

But general question is where to move in this area. The serialization mechanism is inefficient and taking a huge amount of space in the database, so a split of model and data would be desirable, at least for storage. So maybe the manipulation above could be a starting point for that as well. Or of course the possiblity to support someting else like JSON (maybe even https://www.w3.org/TR/vc-data-model/ to allow interaction with other systems). Since all this impacts both long-term storage and the interaction with non-corda systems/clients, IMHO simplicity could be an important characteristic. Understanding and replicating all the AMQP things is rather challening and support beyond Java very limited. If there is an interest for JSON prototype PR, may would find time as well. I'm not quite sure if this will ever be an option or more a "hell will first freeze over" scenario :-) since there has been quite some investiment in AMQP for serialization. For sure it would have to complement the existing serialization rather than replace it.

remmeier commented 4 years ago

first draft in the commit above, 8x the performance

remmeier commented 4 years ago

I managed to find time to look deeper into the topic. opened the first two PRs if there is an interest. Will then have some more. Have already seen 1000 (bulk-based) tps on a quad core machine with room for improvement. Can share more details & code in the near future.

My current main question would be whether two apply the same optimization as in the commit above to DBTransactionStorage? Meaning storing the schema separely. Would be doable without any deeper Corda change, just an update to the storage layer. And could be made backward compatible by support old/new storage format. It should shrink database storage usage quite drastically. In those 1000 tps, recording transactions to the database is currently one of the main bottlenecks (due to Hibernate and use of Corda serialization and some other already optimized things).

oabasov-lohika-tix commented 4 years ago

Hi @remmeier , faced the same issue, checked your first commit, and failed with errors during the deserialization, could you please explain does it require to adapt DeserializationInput to support your schema placeholder. I will try to check the second commit a bit later. Thanks for help anyway.

Don't know if it's a related topic, but I see that there is another bottleneck, not only in ser\deser process, but smth happens after retrieving the message back on a client via corda future.

2020-01-08 12:12:45.982 DEBUG [,,,] 78798 --- [global-threads)] n.c.c.r.internal.RPCClientProxyHandler   : Got message from RPC server RpcReply(id=fc317c4a-3de4-4936-b4c3-768b8b727245, timestamp: 2020-01-08T10:12:44.237Z, entityType: Invocation, result=Success(FlowHandleImpl(id=[16566124-f7d2-41cf-b3a4-f86846073632], returnValue=net.corda.core.internal.concurrent.CordaFutureImpl@58f8aa01)), deduplicationIdentity=e3f6d696-dea4-45b0-95b8-f9c0fe363a9f)
2020-01-08 12:12:45.986 DEBUG [,,,] 78798 --- [global-threads)] n.c.c.r.internal.RPCClientProxyHandler   : Got message from RPC server Observation(id=b3f0b064-6d82-4900-85e6-e70b7d00926a, timestamp: 2020-01-08T10:11:26.411Z, entityType: Invocation, content=[rx.Notification@b461fac0 OnNext Added(stateMachineInfo=StateMachineInfo([16566124-f7d2-41cf-b3a4-f86846073632], com.tradeix.cordapp.paymentcommitment.workflow.asset.flow.ImportAssetFlow))], deduplicationIdentity=e3f6d696-dea4-45b0-95b8-f9c0fe363a9f)
2020-01-08 12:12:45.987 DEBUG [,,,] 78798 --- [global-threads)] n.c.c.r.internal.RPCClientProxyHandler   : Got message from RPC server Observation(id=12887a04-f22c-422d-b684-c679f137d66b, timestamp: 2020-01-08T10:12:45.979Z, entityType: Invocation, content=[rx.Notification@4c59250 OnNext Starting], deduplicationIdentity=e3f6d696-dea4-45b0-95b8-f9c0fe363a9f)
2020-01-08 12:12:58.603 DEBUG [,,,] 78798 --- [global-threads)] n.c.c.r.internal.RPCClientProxyHandler   : Got message from RPC server Observation(id=b83c15ca-9047-4958-a106-65165e5abfbd, timestamp: 2020-01-08T10:12:45.975Z, entityType: Invocation, content=[rx.Notification@e03cfa2d OnNext [B@2dceac3d], deduplicationIdentity=e3f6d696-dea4-45b0-95b8-f9c0fe363a9f)
2020-01-08 12:12:58.605 DEBUG [,,,] 78798 --- [global-threads)] n.c.c.r.internal.RPCClientProxyHandler   : Got message from RPC server Observation(id=b83c15ca-9047-4958-a106-65165e5abfbd, timestamp: 2020-01-08T10:12:45.975Z, entityType: Invocation, content=[rx.Notification@15895539 OnCompleted], deduplicationIdentity=e3f6d696-dea4-45b0-95b8-f9c0fe363a9f)

There is a big gap between events 12:12:45.987 OnNext Starting - the start of the flow which consumes 1k objects 12:12:58.603 OnNext [B@2dceac3d] - the actual result of the operation. So it's ~12.5s. According to Jprofiler corda processed flow in ~1.3s and send the result back. Do you have any clue what can consume so much time?

remmeier commented 4 years ago

there is no change necessary on the deserialization logic. Nothing changes on-the-wire.

Will have a further commit ready in about 1h. Just running the tests which take a while. This commit will make sure unit test cover all the changes.

Did you take the commit from my PR? or the old one I mentioned earlier (outdates one)? I noticed that the PR failed as well, while it worked locally for me. Have to look into that. Since my first commit I have various more optimizations, will create PRs for them invidiually for people to decide which one they like or which ones they don't... After that I properly I will update my fork with all the changes and publish it somewhere together with our new bulk flows and a ready-to-use example.

I did not analyze corda-rpc so far. Maybe a run of JProfiler or something helps.

oabasov-lohika-tix commented 4 years ago

Did you take the commit from my PR?

The old one (https://github.com/remmeier/corda/commit/8963d61facec7361e591a3f997df0fcc4516d7c4), tried to apply on community Corda release os 4.4.

remmeier commented 4 years ago

For today I would only look at the PRs. The old commit from a month a ago was more an early prototype, based on Corda 4.1 and had a number of other changes that made the Corda build break. Since then I have further optimizations beyond serialization. The PR should be fine. Just now updated to verify the the optimization in SerializationOutputTests.

I'm proceeding with further PRs. Within the next one, two weeks I further hope to publish our bulk-based flows and an example on Github together with a flavor of Corda having all optimizations applied. That will show how fast we got for cardossier. As usual for optimizations, many pieces have to fall in place for everything to really work out. But the PR above is one of the more substantial ones and should be noticable on its own.

remmeier commented 4 years ago

https://github.com/corda/corda/pull/5841 addresses the redudant computation of the serialized schema. This helps with CPU performance, but does nothing for DB size. But for the DB the issues are very similar. Every record of type DBTransactionStorage.DBTransaction host contain a redudant copy of the schema making up most of the space (raw in older versions, compressed in newer version). This makes the storage of a transaction both expensive from CPU (serialization+ compression) and disk.

Not complete, but people may can decide whether there is an interest in someting like this:

https://github.com/remmeier/corda/blob/feature/CORDA5725-optimize-db-size/node/src/main/kotlin/net/corda/node/services/persistence/DBTransactionStorage.kt

It introduces DBTransactionStorage.DBTransactionSchema with a primary key schemaId. It would allow to share the schema between many transactions. If Corda would like to move to larger data sets with millions or hundereds of millions of records, something like this could proof useful.

An alternative would be to move the schema into the attachments. Maybe cleaner but also a larger changes.

corda / corda

Serialization layer as serious bottleneck #5725