Open a24lorie opened 11 months ago
I have found a thread on StackOverflow on how to handle the 429 error that could be a potential fix for this
https://stackoverflow.com/questions/22786068/how-to-avoid-http-error-429-too-many-requests-python
""" Receiving a status 429 is not an error, it is the other server "kindly" asking you to please stop spamming requests. Obviously, your rate of requests has been too high and the server is not willing to accept this.
You should not seek to "dodge" this, or even try to circumvent server security settings by trying to spoof your IP, you should simply respect the server's answer by not sending too many requests.
If everything is set up properly, you will also have received a "Retry-after" header along with the 429 response. This header specifies the number of seconds you should wait before making another call. The proper way to deal with this "problem" is to read this header and to sleep your process for that many seconds.
You can find more information on status 429 here: https://www.rfc-editor.org/rfc/rfc6585#page-3 """
fsspec.implementations.DatabricksFileSystem._send_to_api does not currently handle retries. Retries are used in other implementations, with exponential backoff schemes, to cope with this kind of "should work but not right now" message. If some one were to apply it to this FS, it would be appreciated.
@martindurant Could you point to some implementation that uses those handles the retires to take a look and see if I'm able to implement a fix
Here's a complete, complex version that can almost be applied to this case as-is: https://github.com/fsspec/gcsfs/blob/main/gcsfs/retry.py#L118
It would actually be very reasonable to have a retry decorator in this repo, that can be applied to a number of "call this remote thing" methods.
@martindurant I have created a decorator to manage the 429 HTTP errors, and have tested for reading and writing parquet files using the pyarrow library. Still, it doesn't work for some unknown reason when reading from a partitioned root directory (using the hive schema). Do you believe it wise to merge the current fix into the repo at least and later on try to find the issue when reading from a partitioned directory structure?
I am happy to look at your fix and see if I can suggest some generalisation of it.
Should I create a pull request or do U prefer any other method?
PR is best, yes
PR created
I'm using the "fsspec.implementations.dbfs import DatabricksFileSystem" with pyarrow to write a parquet dataset on the DatabricksFilesystem DBFS, but when using the DatabricksFilesystem implementation with the write_to_dataset method from pyarrow I'm getting the following error:
429 Client Error: Too Many Requests for url: https://adb-.azuredatabricks.net/api/2.0/dbfs/mkdirs
The code used to write to DBFS is the following:
""" import pyarrow as pa import pyarrow.dataset as ds import pyarrow.parquet as pq from fsspec.implementations.dbfs import DatabricksFileSystem
base_path = "/FileStore/write" test_df = pd.read_csv("../data/diabetes/csv/nopart/diabetes.csv")
filesystem = DatabricksFileSystem( instance=,
token=
)
pq.write_to_dataset(arr_table, filesystem=filesystem, compression='none', existing_data_behavior='error', partition_cols=["Pregnancies"], root_path=f"{base_path}/parquet/part", use_threads= False) """
And the full stack trace is:
""" Traceback (most recent call last): File "/venv/lib/python3.9/site-packages/requests/models.py", line 971, in json
return complexjson.loads(self.text, kwargs)
File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/json/init.py", line 346, in loads
return _default_decoder.decode(s)
File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/venv/lib/python3.9/site-packages/fsspec/implementations/dbfs.py", line 255, in _send_to_api
exception_json = e.response.json()
File "/venv/lib/python3.9/site-packages/requests/models.py", line 975, in json
raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tests/dbfs/test_dbfs_dataset.py", line 35, in setUpClass
pq.write_to_dataset(arr_table, filesystem=cls._filesystem, compression='none',
File "/venv/lib/python3.9/site-packages/pyarrow/parquet/core.py", line 3422, in write_to_dataset
ds.write_dataset(
File "/venv/lib/python3.9/site-packages/pyarrow/dataset.py", line 1018, in write_dataset
_filesystemdataset_write(
File "pyarrow/_dataset.pyx", line 3919, in pyarrow._dataset._filesystemdataset_write
File "pyarrow/types.pxi", line 88, in pyarrow.lib._datatype_to_pep3118
File "pyarrow/_fs.pyx", line 1529, in pyarrow._fs._cb_create_dir
File "/venv/lib/python3.9/site-packages/pyarrow/fs.py", line 374, in create_dir
self.fs.mkdir(path, create_parents=recursive)
File "/venv/lib/python3.9/site-packages/fsspec/implementations/dbfs.py", line 138, in mkdir
self.mkdirs(path, kwargs)
File "/venv/lib/python3.9/site-packages/fsspec/spec.py", line 1498, in mkdirs
return self.makedirs(path, exist_ok=exist_ok)
File "/venv/lib/python3.9/site-packages/fsspec/implementations/dbfs.py", line 115, in makedirs
self._send_to_api(method="post", endpoint="mkdirs", json={"path": path})
File "/venv/lib/python3.9/site-packages/fsspec/implementations/dbfs.py", line 257, in _send_to_api
raise e
File "/venv/lib/python3.9/site-packages/fsspec/implementations/dbfs.py", line 250, in _send_to_api
r.raise_for_status()
File "/venv/lib/python3.9/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://adb-.azuredatabricks.net/api/2.0/dbfs/mkdirs
"""