Zaid-Ajaj / Fable.Remoting

Type-safe communication layer (RPC-style) for F# featuring Fable and .NET Apps
https://zaid-ajaj.github.io/Fable.Remoting/
MIT License
272 stars 54 forks source link

Poor binary deserialization performance in web client #283

Closed juselius closed 2 years ago

juselius commented 2 years ago

The following code gives wildly different performance when run in the browser or using the .NET client:

type IData = {
    getData : Guid -> Async<int array * single array>
}

let server =
    Remoting.createApi "https://data.myserver.io"
    |> Remoting.withBinarySerialization
    |> Remoting.buildProxy<IData>

My use case has 950 000 + 330 000 elements, ~60 MB in total. The timings are:

I wouldn't be surprised if the browser was a bit slower than .NET, but not a factor of 30.

kerams commented 2 years ago

The Fable (de)serialization did not undergo any form of optimizations like the .NET bits, so I'm sure there are plenty of low hanging fruits to be reaped there. On the other hand, your use case boils down to creating 2 arrays and writing over a million numbers into them and I don't know if there is anything to speed up there.

https://github.com/Zaid-Ajaj/Fable.Remoting/blob/d52f891d99c3c323463f9b90d0d32b459a32e8c7/Fable.Remoting.MsgPack/Read.fs#L230-L245

You have to remember that JS is interpreted, so it won't be just "a bit slower" even when running nearly identical code. (Now that I think about it, perhaps there is some JIT compilation step in the browser? I don't know.)

Can you post the code you're using to measure the client deserialization time? It would help me in investigating further.

juselius commented 2 years ago

@kerams in the browser I'm just wrapping the API calls in JS.console.time / timeEnd and subtracting the transfer times obtained from the Network Monitor in the browser. On .NET I'm just using System.Diagnostics.StopWatch.

I suspect it's the x.Read call which is causing the slowdown, since it has to do a lot of pattern matching for every element. Perhaps one should "invert" the order for arrays: First look up the type, and the read with a fixed expression. Ugly, but optimizations usually are...

juselius commented 2 years ago

@kerams I'l flip the order of things, and PR if it helps.

kerams commented 2 years ago

First look up the type, and the read with a fixed expression. Ugly, but optimizations usually are...

It's not so simple. MessagePack arrays are not typed, so in theory every element can be of a different type. That of course is not an issue for us, but the individual Read calls are still necessary because int32 can be represented as 1, 3 or 5 bytes each, so you cannot take a shortcut and read the array in 4 byte chunks. It should be possible to use a reduced pattern match for number arrays where we know the element type beforehand, but I'm a bit skeptical as to how much it would help. It's worth trying nonetheless.

Any PR where you show a performance improvement while retaining correctness is hugely appreciated.

juselius commented 2 years ago

@kerams I'll see what I can make of it. Good point about correctness, it's easy to forget when you have been relying on a proper type system for 10+ years...

juselius commented 2 years ago

I have managed to squeeze down the timings from 18900 ms to 4900 ms for raw Int32 and Float32 arrays (so far), while at the same time keeping correctness. The code is not very pretty, but that's the price unfortunately. I will add a few more primitive types and PR.

I think this will do for now, although it should be possible to shave off another 1000 ms or more by bypassing MessagePack altogether. Perhaps I will create a Remoting.PrimitiveBinary package at some point.

juselius commented 2 years ago

Actually the best solution is probably to send everything as bin32, and do the conversion explicitly in the client. That way endianness can also be handled server side, giving additional speed ups.

juselius commented 2 years ago

I have created a minimal PR (#284) which speeds up deserialization by a factor 3-4. The main problem was not pattern matching or function calls, but garbage collection.

It's still possible to gain a factor of 1.5-2, but that requires a lot of manual unrolling and inlining, and makes the code really long, hard to change and ugly. IMHO it's not worth it.

For reference, if you really need maximum performance, convert your data manually to a byte array, and convert back in the client. Using this technique I managed to take the original deserialization down from 17 000 ms to 1600 ms.

kerams commented 2 years ago

Actually the best solution is probably to send everything as bin32, and do the conversion explicitly in the client

That also means integers in your array take 4 bytes, even if you're only concerned about small numbers in your application. In other words, you could be transferring up to 4 times as much data (0-127 are encoded as one byte in MessagePack), though (de)serialization might be faster.

juselius commented 2 years ago

@kerams you are absolutely right. I think the first approach should be using the standard serialization/deserialization, and only if that isn't cutting it, one should go for bin32. In my case I'm transferring geometry coordinates and indices for quite large meshes (> 3M triangles), which means that 99% of the numbers will large than 2^7 and 2^16 anyway.

Zaid-Ajaj commented 2 years ago

@juselius Should be resolved as of latest Fable.Remoting.Client v7.17 can you please take it for a spin? I'll close the issue. Please re-open if the issue persists