dispatchrun / dispatch-py

Python package to develop applications with Dispatch.
https://pypi.org/project/dispatch-py/
Apache License 2.0
56 stars 3 forks source link

Use dill for serialization #121

Closed chriso closed 7 months ago

chriso commented 8 months ago

This PR introduces dill for serialization of coroutine state, replacing pickle from the standard library.

From the dill README:

dill can pickle the following standard types:
- none, type, bool, int, float, complex, bytes, str,
- tuple, list, dict, file, buffer, builtin,
- Python classes, namedtuples, dataclasses, metaclasses,
- instances of classes,
- set, frozenset, array, functions, exceptions

dill can also pickle more 'exotic' standard types:
- functions with yields, nested functions, lambdas,
- cell, method, unboundmethod, module, code, methodwrapper,
- methoddescriptor, getsetdescriptor, memberdescriptor, wrapperdescriptor,
- dictproxy, slice, notimplemented, ellipsis, quit

dill cannot yet pickle these standard types:
- frame, generator, traceback

dill also provides the capability to:
- save and load Python interpreter sessions
- save and extract the source code from functions and classes
- interactively diagnose pickling errors

Dispatch supports serializing coroutines (including generators) and their frames, so that's a non-issue.

The fact that dill can serialize cell vars means that this PR fixes https://github.com/stealthrocket/dispatch-py/issues/117.

One thing I like about dill is the built-in tracing. The DISPATCH_TRACE environment variable can be used to enable dill tracing. Below is an example trace when serializing the state of the functions from https://github.com/stealthrocket/dispatch-py/issues/117.

Example trace: ``` ┬ T4: └ # T4 [31 B] ┬ D2: ├┬ D2: │├┬ T4: ││└ # T4 [16 B] │├┬ D2: [DISPATCH] Serializing DurableCoroutine(main..main): function = main..main (/home/chris/Documents/dispatch-py/fail.py:16) code hash = sha256:dbaf58eb0631ab76bf33303a556635379de62063f03365323a3a3d7d5d2c1a83 args = () kwargs = {} wrapped coroutine = None frame state = -1 IP = 44 SP = 4 stack[0] = stack[1] = NULL stack[2] = stack[3] = DurableCoroutineWrapper(Function._call_async) ││├┬ T4: │││└ # T4 [62 B] ││├┬ D2: │││├┬ D2: ││││├┬ D2: │││││└ # D2 [2 B] ││││└ # D2 [197 B] │││├┬ D2: ││││├┬ Ce1: │││││├┬ F2: ││││││└ # F2 [30 B] │││││├┬ T4: ││││││└ # T4 [33 B] │││││├┬ D2: ││││││├┬ Me1: > │││││││├┬ T1: ││││││││├┬ F2: │││││││││└ # F2 [17 B] ││││││││└ # T1 [34 B] │││││││├┬ T4: ││││││││└ # T4 [22 B] │││││││├┬ D2: ││││││││├┬ T4: │││││││││└ # T4 [64 B] ││││││││├┬ D2: │││││││││└ # D2 [172 B] ││││││││└ # D2 [302 B] │││││││└ # Me1 [371 B] ││││││├┬ T4: │││││││└ # T4 [29 B] ││││││├┬ D2: │││││││└ # D2 [83 B] ││││││├┬ D2: │││││││├┬ D2: ││││││││└ # D2 [121 B] │││││││└ # D2 [146 B] ││││││└ # D2 [768 B] │││││└ # Ce1 [842 B] [DISPATCH] Serializing DurableCoroutineWrapper(Function._call_async): function = Function._call_async (/home/chris/Documents/dispatch-py/src/dispatch/function.py:102) code hash = sha256:16c439fd2da61359756fb7d093d125c96dfba71f90cfdc99e25f806dc4b60d6b args = (,) kwargs = {} wrapped coroutine = DurableCoroutine(Function._call_async) frame state = -1 IP = 102 SP = 4 stack[0] = stack[1] = () stack[2] = {} stack[3] = ││││├┬ T4: │││││└ # T4 [23 B] ││││├┬ D2: │││││├┬ D2: ││││││├┬ D2: │││││││└ # D2 [2 B] ││││││└ # D2 [30 B] [DISPATCH] Serializing DurableCoroutine(Function._call_async): function = Function._call_async (/home/chris/Documents/dispatch-py/src/dispatch/function.py:102) code hash = sha256:16c439fd2da61359756fb7d093d125c96dfba71f90cfdc99e25f806dc4b60d6b args = (,) kwargs = {} wrapped coroutine = None frame state = -1 IP = 102 SP = 4 stack[0] = stack[1] = () stack[2] = {} stack[3] = │││││├┬ D2: ││││││├┬ D2: │││││││├┬ D2: ││││││││└ # D2 [2 B] │││││││└ # D2 [30 B] ││││││├┬ D2: │││││││├┬ D2: ││││││││└ # D2 [2 B] │││││││├┬ T4: ││││││││└ # T4 [30 B] │││││││├┬ D2: [DISPATCH] Serializing DurableGenerator(call): function = call (/home/chris/Documents/dispatch-py/src/dispatch/coroutine.py:9) code hash = sha256:4a80fb324f937fc1f3d2f33d15cb96ba73a0311cdd8371375856b5bbe256b16f args = (Call(function='main..sub1', input=Arguments(args=(), kwargs={}), endpoint='http://host.docker.internal:8000/', correlation_id=1),) kwargs = {} wrapped coroutine = None frame state = -1 IP = 8 SP = 1 stack[0] = Call(function='main..sub1', input=Arguments(args=(), kwargs={}), endpoint='http://host.docker.internal:8000/', correlation_id=1) ││││││││├┬ D2: │││││││││├┬ D2: ││││││││││├┬ T4: │││││││││││└ # T4 [26 B] ││││││││││├┬ D2: │││││││││││├┬ T4: ││││││││││││└ # T4 [16 B] │││││││││││├┬ D2: ││││││││││││├┬ D2: │││││││││││││└ # D2 [2 B] ││││││││││││└ # D2 [11 B] │││││││││││└ # D2 [82 B] ││││││││││├┬ D2: │││││││││││└ # D2 [2 B] ││││││││││└ # D2 [280 B] │││││││││├┬ D2: ││││││││││└ # D2 [35 B] │││││││││└ # D2 [326 B] ││││││││└ # D2 [401 B] │││││││└ # D2 [477 B] ││││││└ # D2 [518 B] │││││├┬ D2: ││││││└ # D2 [44 B] │││││└ # D2 [608 B] ││││└ # D2 [1 MiB] │││└ # D2 [1 MiB] ││├┬ T4: │││└ # T4 [17 B] ││├┬ D2: │││└ # D2 [22 B] ││└ # D2 [1 MiB] │└ # D2 [1 MiB] └ # D2 [2 MiB] ``` Although the size of the outermost object is reported as `2 MB`, the serialized state in this case is ~2KB, which is only slightly larger than the equivalent state when using `pickle`. It's not a fair comparison though; pickle cannot serialize cell vars, and so I need to move the functions to the top-level in order to compare state.

The library also provides tooling for inspecting state offline, which may come in handy in future.

chriso commented 8 months ago

One question I would like to validate is if dill supports custom serialization via reduce, which would be needed to benefit from changes like https://github.com/encode/httpx/pull/3108

Let's hold off on merging this until we have a better idea of the capabilities and trade-offs of dill, or until we have more users running into serialization issues and need a short-term fix. I'd like to explore whether pickle's dispatch tables could be used to solve https://github.com/stealthrocket/dispatch-py/issues/94, or whether dill provides an alternative solution there.

Other question that came to mind, should we use replace other use of pickle with dill (for example)?

Let's stick with pickle for now in the dispatch.proto file, since we're only using it to serialize input and output values which are less likely to include the "exotic" objects that dill is able to serialize.

chriso commented 7 months ago

This is out of sync. I'll reboot if/when necessary.