fsspec / filesystem_spec

A specification that python filesystems should adhere to.
BSD 3-Clause "New" or "Revised" License
1.05k stars 362 forks source link

429 Client Error: Too Many Requests for url azuredatabricks.net/api/2.0/dbfs/mkdirs #1488

Open a24lorie opened 11 months ago

a24lorie commented 11 months ago

I'm using the "fsspec.implementations.dbfs import DatabricksFileSystem" with pyarrow to write a parquet dataset on the DatabricksFilesystem DBFS, but when using the DatabricksFilesystem implementation with the write_to_dataset method from pyarrow I'm getting the following error:

429 Client Error: Too Many Requests for url: https://adb-.azuredatabricks.net/api/2.0/dbfs/mkdirs

The code used to write to DBFS is the following:

""" import pyarrow as pa import pyarrow.dataset as ds import pyarrow.parquet as pq from fsspec.implementations.dbfs import DatabricksFileSystem

base_path = "/FileStore/write" test_df = pd.read_csv("../data/diabetes/csv/nopart/diabetes.csv")

filesystem = DatabricksFileSystem( instance=, token= )

pq.write_to_dataset(arr_table, filesystem=filesystem, compression='none', existing_data_behavior='error', partition_cols=["Pregnancies"], root_path=f"{base_path}/parquet/part", use_threads= False) """

And the full stack trace is:

""" Traceback (most recent call last): File "/venv/lib/python3.9/site-packages/requests/models.py", line 971, in json return complexjson.loads(self.text, kwargs) File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/json/init.py", line 346, in loads return _default_decoder.decode(s) File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/Applications/Xcode.app/Contents/Developer/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/venv/lib/python3.9/site-packages/fsspec/implementations/dbfs.py", line 255, in _send_to_api exception_json = e.response.json() File "/venv/lib/python3.9/site-packages/requests/models.py", line 975, in json raise RequestsJSONDecodeError(e.msg, e.doc, e.pos) requests.exceptions.JSONDecodeError: Expecting value: line 1 column 1 (char 0) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/tests/dbfs/test_dbfs_dataset.py", line 35, in setUpClass pq.write_to_dataset(arr_table, filesystem=cls._filesystem, compression='none', File "/venv/lib/python3.9/site-packages/pyarrow/parquet/core.py", line 3422, in write_to_dataset ds.write_dataset( File "/venv/lib/python3.9/site-packages/pyarrow/dataset.py", line 1018, in write_dataset _filesystemdataset_write( File "pyarrow/_dataset.pyx", line 3919, in pyarrow._dataset._filesystemdataset_write File "pyarrow/types.pxi", line 88, in pyarrow.lib._datatype_to_pep3118 File "pyarrow/_fs.pyx", line 1529, in pyarrow._fs._cb_create_dir File "/venv/lib/python3.9/site-packages/pyarrow/fs.py", line 374, in create_dir self.fs.mkdir(path, create_parents=recursive) File "/venv/lib/python3.9/site-packages/fsspec/implementations/dbfs.py", line 138, in mkdir self.mkdirs(path, kwargs) File "/venv/lib/python3.9/site-packages/fsspec/spec.py", line 1498, in mkdirs return self.makedirs(path, exist_ok=exist_ok) File "/venv/lib/python3.9/site-packages/fsspec/implementations/dbfs.py", line 115, in makedirs self._send_to_api(method="post", endpoint="mkdirs", json={"path": path}) File "/venv/lib/python3.9/site-packages/fsspec/implementations/dbfs.py", line 257, in _send_to_api raise e File "/venv/lib/python3.9/site-packages/fsspec/implementations/dbfs.py", line 250, in _send_to_api r.raise_for_status() File "/venv/lib/python3.9/site-packages/requests/models.py", line 1021, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 429 Client Error: Too Many Requests for url: https://adb-.azuredatabricks.net/api/2.0/dbfs/mkdirs """

a24lorie commented 11 months ago

I have found a thread on StackOverflow on how to handle the 429 error that could be a potential fix for this

https://stackoverflow.com/questions/22786068/how-to-avoid-http-error-429-too-many-requests-python

""" Receiving a status 429 is not an error, it is the other server "kindly" asking you to please stop spamming requests. Obviously, your rate of requests has been too high and the server is not willing to accept this.

You should not seek to "dodge" this, or even try to circumvent server security settings by trying to spoof your IP, you should simply respect the server's answer by not sending too many requests.

If everything is set up properly, you will also have received a "Retry-after" header along with the 429 response. This header specifies the number of seconds you should wait before making another call. The proper way to deal with this "problem" is to read this header and to sleep your process for that many seconds.

You can find more information on status 429 here: https://www.rfc-editor.org/rfc/rfc6585#page-3 """

martindurant commented 11 months ago

fsspec.implementations.DatabricksFileSystem._send_to_api does not currently handle retries. Retries are used in other implementations, with exponential backoff schemes, to cope with this kind of "should work but not right now" message. If some one were to apply it to this FS, it would be appreciated.

a24lorie commented 11 months ago

@martindurant Could you point to some implementation that uses those handles the retires to take a look and see if I'm able to implement a fix

martindurant commented 11 months ago

Here's a complete, complex version that can almost be applied to this case as-is: https://github.com/fsspec/gcsfs/blob/main/gcsfs/retry.py#L118

It would actually be very reasonable to have a retry decorator in this repo, that can be applied to a number of "call this remote thing" methods.

a24lorie commented 11 months ago

@martindurant I have created a decorator to manage the 429 HTTP errors, and have tested for reading and writing parquet files using the pyarrow library. Still, it doesn't work for some unknown reason when reading from a partitioned root directory (using the hive schema). Do you believe it wise to merge the current fix into the repo at least and later on try to find the issue when reading from a partitioned directory structure?

martindurant commented 11 months ago

I am happy to look at your fix and see if I can suggest some generalisation of it.

a24lorie commented 11 months ago

Should I create a pull request or do U prefer any other method?

martindurant commented 11 months ago

PR is best, yes

a24lorie commented 10 months ago

PR created