Open rabernat opened 6 years ago
Hm, not too much to go on there, except that it's clearly trying to re-authenticate. I wonder if we could be doing a better job of caching the GCSFileSystem instances in a given worker, or if this is just a too-many-concurrent-requests kind of thing. In any case, I would first suggest trying to throttle the number of workers that are writing, to see if that helps.
I would first suggest trying to throttle the number of workers that are writing, to see if that helps.
Can I accomplish that using a write lock?
Certainly, but then you would loose parallelism. Perhaps Variable would allow you to limit the number of workers/threads (@mrocklin , suggestions?)
I tried this with zarr's thread synchronizer to prevent simulataneous writes to GCS. No luck, same errors. I am still stuck with issue and unable to move forward.
I am also seeing these errors on my worker logs
distributed.worker - WARNING - Compute Failed Function: getter args: (ImplicitToExplicitIndexingAdapter(array=CopyOnWriteArray(array=LazilyOuterIndexedArray(array=_ElementwiseFunctionArray(LazilyOuterIndexedArray(array=<xarray.backends.netCDF4_.NetCDF4ArrayWrapper object at 0x7f7fba5ddf28>, key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None), slice(None, None, None)))), func=functools.partial(<function _apply_mask at 0x7f7fba631488>, dtype=dtype('float32'), decoded_fill_value=nan, encoded_fill_values=[-1e+20]), dtype=dtype('float32')), key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None), slice(None, None, None)))))), (slice(66, 67, None), slice(43, 44, None), slice(0, 2700, None), slice(0, 3600, None)), True, <SerializableLock: 376ac45c-c1c4-4946-8f5a-3719b667d218>) kwargs: {} Exception: OSError('Too many open files',)
I guess it could be related.
I wonder, would it be useful to provide an insecure token mode, i.e., where the actual access token is passed to all the instances, rather than using local renew tokens which cause the calls to the /token/ endpoint? I call this insecure, since the tokens would be passed in open channels, but this is not an issue within the isolated network of kubernetes.
I think the following should do it: you should set up a gcsfs instance, and perform any operation on it (the first operation will cause the token refresh) and then
token=gcs.session.credentials
in storage parameters (be sure to also explicitly give the project when you do this).
Was this issue completely resolved? I've been running into this exact problem when moving very large datasets (~1TB). Reducing Dask cluster size seems to help. I am using,
token=gcs.session.credentials
As mentioned by @martindurant above.
No, I don't think we have a concrete solution, the problem comes from some sort of rate limit accessing the google metadata service.
I ran into the same problem (in multi-process CloudFiles). I found this stack overflow that says it could also be too many open file descriptors (i.e. network connections), but I think you are probably right that it's a Google rate limit.
https://stackoverflow.com/questions/15286288/what-does-this-python-requests-error-mean
I wonder if it would be possible to let these connections share the DNS / auth information.
I am trying to push a very large dataset to gcs via the xarray / zarr / gcsfs / dask stack. I have encountered a new error at the gcsfs level.
Here's a summary of what I am doing
I'm doing this via a distributed client connected to a local multithreaded cluster.
There are almost a million tasks in the graph. It will generally get about 5% in and then hit some sort of intermittent, non-reproducible error.
This is the error I have now.