d-markey / squadron_builder

Dart code generator for Squadron workers. Implement your worker service and let squadron_builder bridge the gap with Web Workers and Isolates!
https://pub.dev/packages/squadron_builder
MIT License
15 stars 3 forks source link

[Performance] Add option to marshal/unmarshal only for web. #9

Closed AlexDochioiu closed 11 months ago

AlexDochioiu commented 11 months ago

I think having an option to skip it for VM is really important for performance. After migrating to squadron (via squadron_builder) I see a huge degradation of performance due to all the marshaling/unmarshaling. My tests went from taking 10s (before squadron) to taking 40s. I manually removed the marshaling/unmarshaling from the generated files and performance went back to 10s.

P.s. I marshal/unmarshal to/from bytes list so the problem is not due to lack of optimization. I just require to send a decent amount of data and run a few thousand operations in a short time-span.

d-markey commented 11 months ago

Hello @AlexDochioiu,

I'll be looking into this but would like a bit more context about your issue right now. First, do you need to run on VM and browser? If all you need is VM, then you probably don't need marshaling and you can drop the SerializeWith annotations so squadron_builder will not generate the marshaling code you removed manually.

If you need both platforms, I understand you removed the marshaling code and tested on VM only (if it worked in browsers after removing marshaling code, it means you don't need marshaling at all!). So I assume browser version still requires marshaling.

In that case, have you tried providing different marshalers based on the target platform? The idea is to keep your current marshalers for browser platforms and use the IdentityMarshaler for VM (this special marshaler simply returns the object to be transfered as is -- no serialization and no instantiation). You can then conditionally import the right marshalers depending on the target platform, so that no serialization/deserialization happens on VM.

Skipping serialization depending on platform may be feasible but would complexify the generated code with a bunch of ifs I wanted to avoid. I will look a bit more into the performance aspects on VM and see if it's possible to improve, but I'm afraid this will take several weeks because December is pretty busy on my side! I'll let you know when I have more info.

AlexDochioiu commented 11 months ago

Hello @d-markey,

Yes, I need to run both VM and web. I (temporarily) manually removed the marshaling/unmarshaling only to confirm that this is the performance drain on VM. Without them it doesn't work on web.

You can then conditionally import the right marshalers depending on the target platform

Hmm, that sounds like a reasonable solution. I'll do this so that I can still use squadron_builder instead of just plain squadron (current solution).

Skipping serialization depending on platform may be feasible but would complexify the generated code with a bunch of ifs I wanted to avoid. I will look a bit more into the performance aspects on VM and see if it's possible to improve, but I'm afraid this will take several weeks because December is pretty busy on my side! I'll let you know when I have more info.

No pressure here. I opened this gh issue just cause I think this could be a good improvement to the library. I personally think that creating shared code for VM and WEB is the main attraction point of squadron! Which is why it seems like it would be good to make sure it's optimised for both.

AlexDochioiu commented 11 months ago

Also, thank you a lot for taking the time to get back to me.

AlexDochioiu commented 11 months ago

One more question if you don't mind. Is it possible to configure the web entrypoint? I am not a big fan of the default one

EntryPoint $getWalletTasksActivator() =>
    'lib/workers/wallet_tasks.web.g.dart.js';
d-markey commented 11 months ago

FWIW, there's a sample code for conditionally imported marshalers here: https://github.com/d-markey/squadron_sample/blob/main/lib/src/perf/marshallers/marshallers.dart

AlexDochioiu commented 11 months ago

Thanks, I already have it working though. I do the conditional import directly in the source file (for code generation). I do this to avoid extra files:

import 'marshalers.vm.dart' //
    if (dart.library.js) 'marshalers.web.dart'
    if (dart.library.html) 'marshalers.web.dart';

You can prob change this:

import 'stub.dart'
    if (dart.library.io) 'vm.dart'
    if (dart.library.js) 'web.dart'
    if (dart.library.html) 'web.dart';

const personMarshaller = StringPersonMarshaller();
const intListAsStringMarshaller = ListIntAsStringMarshaller();
const intListAsBufferMarshaller = ListIntAsBufferMarshaller();
const bigIntMarshaller = BigIntMarshaller();

to

export 'stub.dart'
    if (dart.library.io) 'vm.dart'
    if (dart.library.js) 'web.dart'
    if (dart.library.html) 'web.dart';

or just (no stub)

export 'vm.dart'
    if (dart.library.js) 'web.dart'
    if (dart.library.html) 'web.dart';

to reduce the boilerplate

AlexDochioiu commented 11 months ago

One question (if you don't mind). Is there any (performance) value in creating more @SquadronService() classes (and generated multiple js files)? Wondering if I should bother separating into independent workers based on use-case or just bundle them together and call it a day.

d-markey commented 11 months ago

I suppose "it depends" :-)

Personnally, I wouldn't bother unless / until there's evidence that splitting is necessary.

Just remember workers run tasks on their own event loop. If your service methods are all synchronous, the worker will only execute one task at a time and queue the rest of the workload. Worker pools are here for such scenarios, helping distribute the workload across multiple workers. If you don't want to rely on worker pools for some reason, and if your use case involves having several synchronous tasks running at the same time, then you might have to create several @SquadronService() classes.

Implementing different service classes may help with maintenance and changeability too, where each service will have clear responsibility and avoid having a big, single service class handling many different use cases. But then again, there are ways to organise your code base and avoid this situation, eg. implement proper dedicated classes for each use case and use the service class as a "hub" for routing workloads to the right use case.

Last thing, you might also want to consider the size of the worker code when compiled to JavaScript. I believe the Dart JS runtime is baked into each worker JS file as well as any library you're using. So maybe size can become an issue at some point.

AlexDochioiu commented 11 months ago

Thank you for the detailed answer!

Just remember workers run tasks on their own event loop

That's only true when the same instance of a worker is reused, right? If I create two different worker instances (from the same squadron service) than each has its own event loop and can run in parallel?

service class as a "hub" for routing workloads to the right use case

That's exactly what I am doing and wanted to find out if there's any downsides to this approach.

I believe the Dart JS runtime is baked into each worker JS file as well as any library you're using.

This to me makes a good case for small number/single squadron service per project (or per module).

d-markey commented 11 months ago

That's only true when the same instance of a worker is reused, right? If I create two different worker instances (from the same squadron service) than each has its own event loop and can run in parallel?

Yes, one worker = one dedicated event loop.

d-markey commented 11 months ago

BTW does it work better with the conditionally imported marshalers?

I believe that's the only way to go when you need Web support. In the JS world, marshalers are a necessary evil, and I believe performance issues are mostly caused by serialization/deserialization. Transferring the byte buffer should be fast, but building it efficiently might be difficult especially if you cannot determine the right buffer size from the start (to avoid having a list of bytes that needs to grow periodically during the serialization process). Conversely when the byte buffer is received, deserializing it efficiently can be even more difficult for instance if your data contains lists with many objects: each object will need to be rehydrated requiring many instatiations. I don't know the shape/size of your data but I suspect that's the reason you saw such a degradation in performance.

Regarding the stub file, I prefer this approach to make sure that, if the platform is not supported for some reason, implementations throw. I haven't had time to look at wasm for instance, maybe it would require another kind of implementation if it's not just a "flavor" of the browser platform.

AlexDochioiu commented 11 months ago

BTW does it work better with the conditionally imported marshalers?

Yes. Before moving to squadron I had a custom implementation for isolate-based workers+pools (no web support). The current performance with conditionally imported marshalers (using worker pool) is basically the same as what I had previously with my own implementation.

Thanks for all the help!

d-markey commented 11 months ago

One more question if you don't mind. Is it possible to configure the web entrypoint? I am not a big fan of the default one

EntryPoint $getWalletTasksActivator() =>
    'lib/workers/wallet_tasks.web.g.dart.js';

Also what did you not like with this entry point? You can customize the output folder to match your Web server deployment (just provide a baseUrl argument to @SquadronService()). Then squadron_builder simply appends the ".js" extension to the generated Dart file name eg. "my_service.web.g.dart". You can customize the generated Dart file name in the build.yaml configuration file (see build_runner's build_extensions options).

AlexDochioiu commented 11 months ago

In order to get it working I had to change

EntryPoint $getWalletTasksActivator() =>
    'lib/workers/wallet_tasks.web.g.dart.js';

to

EntryPoint $getWalletTasksActivator() =>
    '/assets/packages/my_packange_name/workers/wallet_tasks.web.g.dart.js';

Otherwise I get a 404 not found on the js file when I run web. Also I cannot just add a prefix cause I need to remove the lib/ prefix from the auto-generated string

d-markey commented 11 months ago

Then try with @SquadronService(baseUrl: '/assets/packages/my_packange_name/workers')

AlexDochioiu commented 11 months ago

Oh yes, this does the trick. I was under the impression that the baseUrl would just add a prefix to whatever was there, as opposed to fully changing the prefix to file.

d-markey commented 11 months ago

Hello again,

so that solves it for cross-platform apps: the right way to go is to conditionally provide marshalers with Web marshalers taking proper action and VM marshalers doing nothing.

For VM only, no marshaling is needed.

When it must run on browsers, because of the structured clone requirement, the only option today is to implement marshalers, or to operate on bare Lists and Maps of nums, strings and bools.

I've played a bit with this and tried this simple test (no marshaling involved):

  @SquadronMethod()
  Future<Map<String, dynamic>> native(Map<String, dynamic> data) async => data;

I've tried with two different data structure implementations:

    const keys = 'abcdefghijklmnopqrstuvwxyz';

    final dataIn = Map.fromIterables(
        List.generate(keys.length, (index) => keys[index]),
        List.generate(keys.length, (index) => List.filled(index * 1000, 0)));

and

    final dataIn2 = Map.fromIterables(
        List.generate(keys.length, (index) => keys[index]),
        List.generate(keys.length, (index) => Uint32List(index * 1000)));

The test makes 100 calls to the native method: for dataIn (using List<int>), it took 10-20 seconds to run. For dataIn2 (using Uint32List), execution time was cut down to ~0.5 second. So it seems the performance of transferring data depends a lot on the shape of the data structure and the underlying types.

AlexDochioiu commented 11 months ago

The test makes 100 calls to the native method: for dataIn (using List), it took 10-20 seconds to run. For dataIn2 (using Uint32List), execution time was cut down to ~0.5 second. So it seems the performance of transferring data depends a lot on the shape of the data structure and the underlying types.

Those are the times when passing data to/from web worker, right? (not for VM)

I am marshaling all of my custom data to UInt8List, so I think/hope that it should perform as well as or better than Uint32List. I don't see any slow operations on either web or vm at the moment, so I think I'm all good.