Open BarrySlyDelgado opened 1 month ago
This isn't a solution, but I've spent a lot of time in the last year or so getting much deeper understanding of whats happening with Python's pickle-style serialization, so just some notes:
dill
has some subtle ways in which it changes what it serializes for various objects, based on an internal heuristic "do I think you'll be able to import
this module remotely?" - if yes, it'll serialize the equivalent of import that.thing as x
. If not, it can end up serializing entire module definitions (including in one degenerate case large parts of the Python default libraries...) in a fairly hardcore attempt to let you recreate the content remotely. In the situations I've encountered, this choice is most often affected by the path of the file that the object is (or related objects are) defined in. The difference between __main__
and not-main above is one of the situations I encounter this quite a lot.
if you're in serverless mode, import
statements are going to amortize across all tasks (giving hundreds of tasks/sec with WQ+coprocesses), no matter where the import happens - so in Parsl we haven't been pursuing a direction of "avoid imports", but instead we've been more in the direction of "keep using the same Python process, so that the import probably happened already" and "we know the import will work, because we demand the user has a compatible environment remotely". (that might not align with the direction of Task Vine, though)
(for noticing the "serializing far too much" situation with dill, it's usually pretty apparent from the byte count of the serialized object)
Attempting to separate and clear out various issues..
@BarrySlyDelgado is there still a distinct problem to solve here?
For serverless mode: #3902 does the deserialization (and implied imports) prior to forking, and so later function calls have much lower latency.
For PythonTask mode, there really isn't any avoiding the cost of doing (necessary) imports once. Conceivably we could modify serialization to do less work under certain constraints, but I am loath to depart from standard, simple approaches.
For the question:
is there still a distinct problem to solve here?
For serverless mode:
if you're forking fresh for each task, even after performing a deserialization, there are a few Parsl use patterns that might clash with that (compared to how things work with Parsl's High Throughput Executor, which reuses worker processes). Maybe or maybe not relevant specifically to this issue and/or to parsl+task vine, but here's a quick brain dump.
1) traditionally we have encouraged people to put import
statements inside their task definition - rather than hoping imports come from the surrounding defining code. (this comes from serialization modes where whichever serializer is in use does not capture those inputs)
I think that is on topic for this issue.
The next two are related behaviours but aren't necessarily on topic for this issue #3901, but I'd like to write them down "somewhere":
2) some users have a model of amortizing other expensive initializations that aren't imports by having some other "worker local / task global cache" - possibly @wardlt is involved in this - I know we've talked about it before
3) proxystore use (@gpauloski ) where a proxy is deserialized inside a worker task "on demand" might have some performance degradation there - I'm not entirely clear about how much sharing in different places proxystore does.
possibly @WardLT is involved in this - I know we've talked about it before We might have talked about this: https://github.com/ExaWorks/warmable-function-calls
I use caches to keep some objects in memory between calls in a few of our applications. Sometimes that's explicitly in the function code (as in the above examples), and other times it happens because I'm passing data by reference using proxystore and proxystore maintains the cache.
@dthain I think the problem is broadly only getting necessary imports.
@gpauloski Noted that there is an import overhead using python tasks. From the example in #3892 there is a apparent import overhead of 1 tasks/sec. This may be an issue in how we serialize functions for distribution, which I may not understand entirely.
Currently, we use
cloudpickle
to serialize python function and arguments. There are some nuances regarding different serialization modules that are worth discussing. if we serialize the function below, loading the same function will cause imports to happen when deserialized.We see this if we try to deserialize without the relevant module in our environment
If we comment out relevant imports this is not an issue unless you branch on the path that would use the import
From my perspective, this is the preferred failure case.
The first example above also causes an increased latency to deserialize the function:
Vs. second example
with
dill
this is not an issue in either case if the function is defined in main:However, the example has different behavior for functions defined outside the
__main__
module similar to that ofcloudpickle
For example, iffunc
is defined outside__main__
:Additionally, switching
cloupickle
todill
does not necessarily improve latency,From the example in #3892
cloudpickle
:dill
: