Open itamarst opened 3 years ago
Hrm, so dask-worker
does run the worker in a sub process by default (we create a Nanny, which runs Worker in a spawned process). However, this isn't universal (some people run without the Nanny process).
Doing the same fix as we did in dask.multiprocessing would be a good first step here, and would probably solve 90% of the problem.
Going beyond that though, I'm not sure. We could consider alternative hashing functions. Are there ways to specify the hash see programmatically somehow?
There's no public API for setting it after startup, which makes sense given what it does. Only thing I can think of is something like
import os, sys
if not os.environ.get("PYTHONHASHSEED") == "6640":
os.environ["PYTHONHASHSEED"] = "6640"
os.execv(sys.argv)
in the startup code (which won't work on Windows...).
For the Distributed-specific issue, is non-nanny mode actually useful, or could it be dropped?
Stepping back to fundamental solution—
with set_hash_seed(6640):
h = hash(obj)
is maybe possible... but there are no public APIs, so you have to be OK with munging private CPython internals. And it's global state, so it will impact other threads' hashing which will e.g. break objects. Might be better to get a PEP through so that it's possible with future Python.
Another approach is a reimplemented hash algorithm. For example https://deepdiff.readthedocs.io/en/latest/deephash.html. The question here is how good it is at hashing arbitrary objects, would need to do some digging/reading/testing. Basic sense of what it supports: https://github.com/seperman/deepdiff/blob/master/deepdiff/deephash.py#L429
deephash seems promising:
from deepdiff import DeepHash
from pprint import pprint
class CustomHash:
def __init__(self, x):
self.x = x
def __repr__(self):
return f"CustomHash({self.x})"
objects = [
125,
"lalala",
(12, 17),
CustomHash(17),
CustomHash(17),
CustomHash(25),
]
for o in objects:
print(repr(o), "has hash", DeepHash(o)[o])
Results in:
$ python example.py
125 has hash 08823c686054787db6befb006d6846453fd5a3cbdaa8d1c4d2f0052a227ef204
'lalala' has hash 2307d4a08f50b743ec806101fe5fc4664e0b83c6c948647436f932ab38daadd3
(12, 17) has hash c7129efa5c7d19b0efabda722ae4cd8d022540b31e8160f9c181d163ce4f1bf6
CustomHash(17) has hash 477631cfc0315644152933579c23ed0c9f27743d632613e86eb7e271f6fc86e4
CustomHash(17) has hash 477631cfc0315644152933579c23ed0c9f27743d632613e86eb7e271f6fc86e4
CustomHash(25) has hash 8fa8427d493399d6aac3166331d78e9e2a2ba608c2b3c92a131ef0cf8882be14
Moved discussion of Dask-side fix to https://github.com/dask/dask/issues/6723
There really ought to be a warning about this in the documentation. It would be one thing if this raised an exception and just broke your application, but instead it unexpectedly gives completely wrong results.
If it's impossible to come up with a good hash function in two and a half years, maybe the solution is to just give groupby
a hash_function
argument. This could default to hash()
for data where it's safe to use (like integers), and otherwise raise an exception if it's not supplied.
For those following this, noting that I've just revived the discussion over in https://github.com/dask/dask/issues/6723#issuecomment-1848010871.
Doing the same fix as we did in dask.multiprocessing would be a good first step here, and would probably solve 90% of the problem.
Gradually understanding the problem/solution space here, and realized that this basic fix has not yet been implemented, which I agree that would solve a large percentage of use cases. I will work up a PR for this for others to consider.
As per https://github.com/dask/dask/issues/6640, Dask breaks if Python hashing is inconsistent across workers. This appears to be a bug in Distributed Dask backend as well:
hashing.py:
When run:
Solving this
The solution for Dask (https://github.com/dask/dask/pull/6660) was to set PYTHONHASHSEED for worker processes. And you can do that similar solution for Distributed in some cases, e.g.
Client()
.However, I'm pretty sure the
distributed-worker
CLI just runs the worker inline, it's not a subprocess, so by the time you're running Python code the hash seed has already been set, and you can't change it. It could e.g. set it and the fork()+exec() Python again, I suppose.