Overhead from serialization contributing to task execution latency

BarrySlyDelgado commented 1 month ago

@gpauloski Noted that there is an import overhead using python tasks. From the example in #3892 there is a apparent import overhead of 1 tasks/sec. This may be an issue in how we serialize functions for distribution, which I may not understand entirely.

Currently, we use cloudpickle to serialize python function and arguments. There are some nuances regarding different serialization modules that are worth discussing. if we serialize the function below, loading the same function will cause imports to happen when deserialized.

import matplotlib.pyplot as plt

def func(x):
    if x == 1:
        return x
    else:
        plt.plot(1,1)

We see this if we try to deserialize without the relevant module in our environment

Traceback (most recent call last):
  File "/g/g16/slydelgado1/bslydelg_cctools/cctools/taskvine/src/bindings/load.py", line 10, in <module>
    x = cloudpickle.load(f)
        ^^^^^^^^^^^^^^^^^^^
  File "/g/g16/slydelgado1/miniconda3/envs/cenv/lib/python3.12/site-packages/cloudpickle/cloudpickle.py", line 457, in subimport
    __import__(name)
ModuleNotFoundError: No module named 'matplotlib'

If we comment out relevant imports this is not an issue unless you branch on the path that would use the import

#import matplotlib.pyplot as plt

def func(x):
    if x == 1:
        return x
    else:
        plt.plot(1,1)

x = cloudpickle.load(f)
x(0)

Traceback (most recent call last):
  File "/g/g16/slydelgado1/bslydelg_cctools/cctools/taskvine/src/bindings/load.py", line 11, in <module>
    x(0)
  File "/g/g16/slydelgado1/bslydelg_cctools/cctools/taskvine/src/bindings/test.py", line 15, in func
    plt.plot(1,1)
    ^^^
NameError: name 'plt' is not defined

From my perspective, this is the preferred failure case.

The first example above also causes an increased latency to deserialize the function:

109.98052644492145 deserializations/s

Vs. second example

4636.137402126293 deserializations/s

with dill this is not an issue in either case if the function is defined in main:

4605.406487074855 deserializations/s

However, the example has different behavior for functions defined outside the __main__ module similar to that of cloudpickle For example, if func is defined outside __main__:

  File "/g/g16/slydelgado1/bslydelg_cctools/cctools/taskvine/src/bindings/load.py", line 10, in <module>
    x = dill.load(f)
        ^^^^^^^^^^^^
  File "/g/g16/slydelgado1/miniconda3/envs/cenv/lib/python3.12/site-packages/dill/_dill.py", line 289, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/g/g16/slydelgado1/miniconda3/envs/cenv/lib/python3.12/site-packages/dill/_dill.py", line 444, in load
    obj = StockUnpickler.load(self)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/g/g16/slydelgado1/miniconda3/envs/cenv/lib/python3.12/site-packages/dill/_dill.py", line 434, in find_class
    return StockUnpickler.find_class(self, module, name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/g/g16/slydelgado1/bslydelg_cctools/cctools/taskvine/src/bindings/serz.py", line 2, in <module>
    import matplotlib.pyplot as plt
ModuleNotFoundError: No module named 'matplotlib'

Additionally, switching cloupickle to dill does not necessarily improve latency,
From the example in #3892

cloudpickle:

16.100728164861362 tasks/s

dill:

10.413069189953614 tasks/s

benclifford commented 1 month ago

This isn't a solution, but I've spent a lot of time in the last year or so getting much deeper understanding of whats happening with Python's pickle-style serialization, so just some notes:

dill has some subtle ways in which it changes what it serializes for various objects, based on an internal heuristic "do I think you'll be able to import this module remotely?" - if yes, it'll serialize the equivalent of import that.thing as x. If not, it can end up serializing entire module definitions (including in one degenerate case large parts of the Python default libraries...) in a fairly hardcore attempt to let you recreate the content remotely. In the situations I've encountered, this choice is most often affected by the path of the file that the object is (or related objects are) defined in. The difference between __main__ and not-main above is one of the situations I encounter this quite a lot.

if you're in serverless mode, import statements are going to amortize across all tasks (giving hundreds of tasks/sec with WQ+coprocesses), no matter where the import happens - so in Parsl we haven't been pursuing a direction of "avoid imports", but instead we've been more in the direction of "keep using the same Python process, so that the import probably happened already" and "we know the import will work, because we demand the user has a compatible environment remotely". (that might not align with the direction of Task Vine, though)

benclifford commented 1 month ago

(for noticing the "serializing far too much" situation with dill, it's usually pretty apparent from the byte count of the serialized object)

dthain commented 1 month ago

Attempting to separate and clear out various issues..

@BarrySlyDelgado is there still a distinct problem to solve here?

For serverless mode: #3902 does the deserialization (and implied imports) prior to forking, and so later function calls have much lower latency.
For PythonTask mode, there really isn't any avoiding the cost of doing (necessary) imports once. Conceivably we could modify serialization to do less work under certain constraints, but I am loath to depart from standard, simple approaches.

benclifford commented 1 month ago

For the question:

is there still a distinct problem to solve here?

For serverless mode:

if you're forking fresh for each task, even after performing a deserialization, there are a few Parsl use patterns that might clash with that (compared to how things work with Parsl's High Throughput Executor, which reuses worker processes). Maybe or maybe not relevant specifically to this issue and/or to parsl+task vine, but here's a quick brain dump.

1) traditionally we have encouraged people to put import statements inside their task definition - rather than hoping imports come from the surrounding defining code. (this comes from serialization modes where whichever serializer is in use does not capture those inputs)

I think that is on topic for this issue.

The next two are related behaviours but aren't necessarily on topic for this issue #3901, but I'd like to write them down "somewhere":

2) some users have a model of amortizing other expensive initializations that aren't imports by having some other "worker local / task global cache" - possibly @wardlt is involved in this - I know we've talked about it before

3) proxystore use (@gpauloski ) where a proxy is deserialized inside a worker task "on demand" might have some performance degradation there - I'm not entirely clear about how much sharing in different places proxystore does.

WardLT commented 1 month ago

possibly @WardLT is involved in this - I know we've talked about it before We might have talked about this: https://github.com/ExaWorks/warmable-function-calls

I use caches to keep some objects in memory between calls in a few of our applications. Sometimes that's explicitly in the function code (as in the above examples), and other times it happens because I'm passing data by reference using proxystore and proxystore maintains the cache.

BarrySlyDelgado commented 1 month ago

@dthain I think the problem is broadly only getting necessary imports.

cooperative-computing-lab / cctools

Overhead from serialization contributing to task execution latency #3901