Increase version flexibility of Dask for initial deployment

mrocklin commented 4 years ago

OK, so Dask itself is now relatively robust to different versions of Python and compression.

However, as @jrbourbeau predicted, cloudpickle is not. This stops users from being able to do things like send along lambdas

future = client.submit(lambda x: x + 1, 10)

This turns out to be somewhat debilitating, For example, our basic example fails because top-level Pandas functions themselves are not reliably pickle-serializable (see https://github.com/pandas-dev/pandas/issues/35611). I suspect that this happens in other cases in our examples as well.

So what to do?

Short term we could duplicate every software environment by Python version. Short term, if we're only supporting coiled/default for quickstart purposes then this probably isn't horrible. We would set the default value for the configuration= keyword to coiled/default or coiled/default-37 depending on the Python version at import time, and then change the quickstart to coiled.Cluster() and use that default.

We could stick with coiled install and be explicit about requiring users to install things. I'm somewhat against this as a startup process. It's unpleasant.

Long term I'd like to see us increase hygiene in Dask and upstream about using pickle-serializable functions. This is a good driver for that. If we were very ambitious we could try to modify cloudpickle to be Python-version agnostic, but putting on my cloudpickle-maintainer hat I'll probably vote against that.

Other thoughts? @jrbourbeau ?

jrbourbeau commented 4 years ago

our basic example fails because top-level Pandas functions themselves are not reliably pickle-serializable

That's unfortunate : /

Short term we could duplicate every software environment by Python version

Agreed that short term this seems okay. Using just coiled.Cluster() actually seems like a slight improvement for the quickstart, but would be brushing too much under the rug for most other examples.

Other thoughts?

Note duplicating coiled/default for each Python version will limit what packages can be installed. What immediately comes to mind is py-xgboost in the defaults channel doesn't have a Python 3.8 build. I'm still okay with this as I'd rather have a streamlined quickstart and then have a separate coiled/xgboost software environment.

We might try updating coiled-examples to run on binder for the short-to-medium term. Users would need to do coiled login at the top of each notebook on binder, which isn't great. But that would remove the need for coiled install coiled/xgboost, etc. and would provide a familiar experience for users. Putting on my user hat, I would expect "Try out our other Coiled example notebooks" to point to binder, or some other running notebook server, not an .ipynb notebook file on GitHub

mrocklin commented 4 years ago

Another option would be to make a version of cloudpickle that is version-flexible and then have coiled depend on that, or at least have our default environments install it by default. The codebase has relatively few Python version-specific bits.

I tried this for about half an hour today and didn't manage to get it work, but it's an option.

@llllllllll do you have any thoughts on how hard it would be to make a cloudpickle fork that was generic/dumb enough to support moving most things between Python versions?

llllllllll commented 4 years ago

I think there are a few challenges with cross version compatible cloudpickle. The first is that cloudpickle sends code objects, including the bytecode itself. The actual format of Python bytecode changes across versions, and instructions are added and removed. This might be solvable with some sort of "upgrade/downgrade" converters that can rewrite bytecode from one version to the next. There is code in codetransformer that handles multiple versions of Python which might serve as a good starting point for these converters. Another problem is that the standard library changes in each minor version. For example, namedtuple in 3.8 no longer uses property objects for the named accessors. Instead, a custom type which is written in C is used. This _tuplegetter object implements the pickle protocol, but in a way that requires that the deserializing side is at least 3.8. Finally, other modules may have conditional logic based on the minor version For example, the following code will likely not work as intended when sent to a previous version:

if sys.minor_version < 8:
    def func(data):
        # use fallback code
else:
    def func(data):
        # use code that depends on some 3.8 specific feature

execute(func, data)

My initial strategy for trying this would be: Just use cloudpickle when the client and server are the same version and advertise that this configuration is best supported. When the versions differ, use a different kind of pickle that only supports a much smaller subset of objects, likely just some builtin types, numpy, pandas, dask, and other core structures. When serializing, don't target speed, target backwards compat. For example, don't send some serialized form of the block manager for a dataframe, send a dict of ndarrays and re-hydrate by calling the dataframe constructor. This will minimize the chance that some implementation detail has changed between versions. Implement the bytecode translator to move just 1 minor version at a time and run multiple upgrades downgrades as needed, but probably try to pair the client and server with something close if possible. With all of these changes, I would now be somewhat worried about the serialization cost, but that could be profiled and possibly mitigated. Any user that is really worried about performance would be encouraged to just match the version exactly.

mrocklin commented 4 years ago

The first is that cloudpickle sends code objects, including the bytecode itself

Hrm, that does sound challenging

When serializing, don't target speed, target backwards compat. For example, don't send some serialized form of the block manager for a dataframe, send a dict of ndarrays and re-hydrate by calling the dataframe constructor. This will minimize the chance that some implementation detail has changed between versions

Agreed. I'm not actually that worried about data movement. It's more often an issue with dynamiclly generated code. From what you say above it sounds like this is probably a no-go anyway.

llllllllll commented 4 years ago

The reason to worry about data when sending code is that you need to capture the closure and the globals, which may include some data structures. I assume it is rare to have a large dataframe or ndarray in the globals or closure, so that is likely not a big deal. Another idea for sending code would be to use something like uncompyle6 (possible license issue) to convert the bytecode to source code and then send that over to be recompiled on the server. This seems to work by decompiling the bytecode instead of reading the source, so it will work in a repl or notebook.

FabioRosado commented 3 years ago

Hello, I'm going through the open issues on this repo and closing some of them. I'm assuming that the conversation moved somewhere else, so I'm closing this issue.

coiled / feedback

Increase version flexibility of Dask for initial deployment #37

So what to do?