box / box-python-sdk

Box SDK for Python
http://opensource.box.com/box-python-sdk/
Apache License 2.0
418 stars 215 forks source link

"Connection reset by peer" errors occur after pauses >15min #763

Closed exdevlin closed 1 year ago

exdevlin commented 1 year ago

Observed

  1. Our application initializes a connection with the BoxSDK near the beginning of execution to fetch IDs and locate files.
  2. Our application then does intense processing that may take anywhere from a few minutes to upwards of half an hour, depending on data ingestion sizes.
  3. Our application then attempts to save new artifacts to Box, and immediately raises this Exception: WARNING:boxsdk.network.default_network:[31mRequest "GET https://api.box.com/2.0/folders/140796660496/items" failed with ConnectionError exception: ConnectionError(ProtocolError('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')))
    • This Exception is only observed if the gap in Step 2 is longer than 15 minutes. This behaviour has only been observed since August 12, 2022
    • We are using Python 3.8.5
    • We are using boxsdk==2.12.1 (attempted upgrading to 3.4.0 as well)

Currently we are wrapping the exception with function that re-initializes the boxsdk.Client() object and retries. While this does work, we would like some clarity on why connections are forcibly terminated on Box's side and what the best practice should be here.

arjankowski commented 1 year ago

Hi @exdevlin ,

Thanks for submitting this issue! We will take a look and get back to you ASAP!

@arjankowski

arjankowski commented 1 year ago

Hi @exdevlin

Unfortunately I'm not able to reproduce this error, even after the 30 minutes gap between calls, on my local machine with macOS.

So I have same questions to this issue: 1) On which operation system you run your application (win, linux, macos)? 2) Does this always happen after a gap in step 2 that is longer than 15 minutes? 2) Can you show me the code example that make this issue? 3) You mention that the error occurs when you try to save new artifacts to Box, but in the error there is info saying that it's occurred during fetching data -> GET https://api.box.com/2.0/folders/140796660496/items. Does the errors occurs after first API request called after Step 2?

exdevlin commented 1 year ago

Hello, and those are excellent questions!

  1. We run this application from linux containers in AWS -- the Docker containers are based off the amazonlinux:2 version. I believe it was based on RedHat? https://hub.docker.com/_/amazonlinux
  2. Yes, this always happens after a gap longer than 15 minutes.
  3. Excellent catch -- in my original question, I summarized in step 3 that the application saves artifacts to Box, but that process does involve first fetching some data about the path through the API, and thus throws an Exception immediately. This can best be summarized as: We are seeing the failures immediately upon trying to access the API.
Jeff-Meadows commented 1 year ago

It's difficult to tell which of the 2 things that could be happening, are indeed happening, but it's really down to just 2 things:

  1. Client is dropping the idle connection (and requests/urllib3 isn't noticing)
  2. Server is dropping the idle connection (and the client isn't handling it)

Here are some things that might be interesting to try:

  1. Set keepalive. (see https://github.com/urllib3/urllib3/blob/main/src/urllib3/connection.py#L109). This could help if the problem is on the client side.
  2. Make a dummy request wrapped in try/except after the processing step (or just retry the first request). I am assuming that this will help if it's the server side dropping the connection; requests should create a new connection for subsequent requests.

I realize if one of those fixes the issue, it won't feel very satisfying, since it seems like things were working pretty well until a few weeks ago. But they're generally good things to do for long-running applications. I hope this helps!

exdevlin commented 1 year ago

Thanks for the suggestions @Jeff-Meadows !

We've actually made efforts in trying your two approaches already (though now we are seeing slightly different behaviour), so I will actually address your two points in reverse to make sense for our timeline progression:

  1. We've actually tried a similar approach in order to get past some of the connectivity issues! We've written a decorator around any function that directly calls the boxsdk or any boxsdk objects to catch the original 'Connection reset by peer' Exception and re-initialize the Client object if needed. However, we are now seeing issues afterwards where we ask the File object to update to a different parent Folder, and we see "Invalid id xxxxxxxx. Expected a string". This error is curious to us because everywhere we've been able to, we have passed along string-cast IDs. We then tried your first suggestion to add Keep-alive to see how it worked:
  2. We were not sure how to precisely add Keep-alive to the session but saw it in the default_headers after using this snippet:
    auth = JWTAuth.from_settings_dictionary(self.config_json)
    session = Session(default_headers={'connection': 'keep-alive'})
    authorized_session = AuthorizedSession(auth, **session.get_constructor_kwargs())
    box_client = Client(auth, authorized_session)

Please let us know if we have done this incorrectly, but we have seen no changes to behaviours by adding this keep-alive.

Jeff-Meadows commented 1 year ago

Hi @exdevlin - I'll start with client-side keepalive. Our SDK's default network implementation is based on the requests library, which is in turn built on top of urllib3. We don't expose this functionality directly, because it's more of a low-level networking thing. The idea, though, would be to import HTTPConnection from urllib3 and set the socket option there. Something like

import socket
from urllib3.connection import HTTPConnection

HTTPConnection.default_socket_options += [
    (socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
]

Ensure this code runs once before you make any Box API calls.

NB: I think several years ago, requests shipped with its own version of urllib3. If your installed version of requests is old enough, you may have to import HTTPConnection differently.

Next, I want to dig in on the issues you're seeing with the re-initialized Client after recovering from the "Connection reset by peer" error. It would help to see some code if at all possible, but it would also help to see the stack trace of such an error. I'm particularly interested in which ID is being referred to here - the file you're moving, or the destination folder.

exdevlin commented 1 year ago

Hi @Jeff-Meadows, thanks for the code snippet. I'll try that and get back to you on it. In the meantime, our installed version of requests appears to be 2.28.1 which is their newest release.

UPDATE: We have implemented the keep-alive code that you gave us, and we are still receiving: Exception message: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) errors.

To answer your question about ID failures -- previous runs have shown that the folder_id is what's failed when in that scenario I mention. The same Folder object has been used previously without trouble, but this failure occurs after that 15 minute lapse as well.

Jeff-Meadows commented 1 year ago

Hi @exdevlin , I think we can rule out the scenario where the client is dropping the connection, so we can focus on handling the error and recovering. The good news is that it looks like we're doing that, although then we're seeing an unexpected error afterward.

Again, it'd really help to see a stack trace. I'm assuming the error is coming from Box's API (not the SDK code), since I don't think the SDK throws that error, but please correct me if I'm wrong about that.

There's a bug in the SDK, somewhere in Box's APIs, or in your code, so we need to track it down. It'd really help to know the details of the failing API call so we can figure out if the parent folder ID is invalid or if something else is going on.

exdevlin commented 1 year ago

I can certainly provide a stack trace and a more generalized version of what's happening in our code.

After the application completes heavy processing (15+ minutes later, eg. we recently just processed a 52 minute file), we upload several newly created artifacts to Box via a chunked uploader due to their sizes being over 20MB, as per the documentation.

File "/app/c_d/box/helper.py", line 484, in upload_in_chunks
    chunked_uploader = destination_folder.get_chunked_uploader(source_path)
    File "/app/boxsdk/util/api_call_decorator.py", line 63, in call
    return method(*args, **kwargs)
    File "/app/boxsdk/object/folder.py", line 160, in get_chunked_uploader
    upload_session = self.create_upload_session(total_size, upload_file_name)
    File "/app/boxsdk/util/api_call_decorator.py", line 63, in call
    return method(*args, **kwargs)
    File "/app/boxsdk/object/folder.py", line 135, in create_upload_session
    response = self._session.post(url, data=json.dumps(body_params)).json()
    File "/app/boxsdk/session/session.py", line 98, in post
    return self.request('POST', url, **kwargs)
    File "/app/boxsdk/session/session.py", line 134, in request
    response = self._prepare_and_send_request(method, url, **kwargs)
    File "/app/boxsdk/session/session.py", line 337, in _prepare_and_send_request
    self._raise_on_unsuccessful_request(network_response, request)
    File "/app/boxsdk/session/session.py", line 264, in _raise_on_unsuccessful_request
    raise BoxAPIException(
    boxsdk.exception.BoxAPIException: Message: Invalid folder_id 174957008340. Expected a string.
    Status: 400
    Code: invalid_folder_id
    Request ID: 1488759ea11cb566c1b5fc64c2af8903
    Headers: {'Server': 'nginx', 'Date': 'Wed, 28 Sep 2022 23:31:11 GMT', 'Content-Type': 'application/json; charset=UTF-8', 'Content-Length': '139', 'X-Box-Original-Ingress-ADC-Host': 'prod-c1-b-traffic-manager-k0ww', 'Strict-Transport-Security': 'max-age=31536000', 'Via': '1.1 google', 'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"'}
    URL: https://upload.box.com/api/2.0/files/upload_sessions
    Method: POST
    Context Info: None

The destination_folder object referenced near the top of that stack trace was created at the beginning of execution before the long processing times. We have found that if we re-initialize that object before this large file transfer post-processing, this error does not happen.

lukaszsocha2 commented 1 year ago

Hi @exdevlin, I was able to reproduce this error, when I initialised folder with id of type int instead of str. Actually this is exactly what exception message says. My code sample, which causes that: client.folder(168518778972).get_chunked_uploader('path/to/file.pdf') but should be: client.folder('168518778972').get_chunked_uploader('path/to/file.pdf') Is it possible that somewhere in your code you use int to get your destination folder?

As a quick solution I would suggest to modify to your code to: destination_folder.get().get_chunked_uploader(source_path). get() method will get all information about this folder including id as str, which should solve the issue. As a long term solution you may try to debug how int id gets there in the first place and try to eliminate it. Please let us know if it helped.

exdevlin commented 1 year ago

As I mentioned in a previous comment: "This error is curious to us because everywhere we've been able to, we have passed along string-cast IDs". Hence the confusion around this error. I also mentioned that this is the same object that was previously initialized and used without issue before the 15 minute gap.

I can try to chain the additional .get() and see if that helps, but given the previous pattern, I suspect the .get() call will fail instead of the .get_chunked_uploader() call, as it has historically failed immediately upon trying to contact the Box API.

Jeff-Meadows commented 1 year ago

I went ahead and found the code in our backend that rejects calls made with a number instead of a string, and it is not used on all of our endpoints (but it is used when creating a chunked upload session). It's definitely a little strange that you're ending up with a folder with an object_id that's a number - I tried creating a folder in a number of different ways and I always ended up with a string ID. I suspect the only way to figure out what's going on there is to periodically (or before/after each call) to log your objects, including the type of its ID, and try to figure out how that's going on.

Perhaps one of our endpoints is erroneously returning a folder ID as a number, or perhaps there's a code path in the SDK that's doing something wrong. I've been unable to find either of those so far, though.

exdevlin commented 1 year ago

Good morning!

We have moved to explicitly storing all string IDs in our code, and it appears to have fixed that specific error. However, it did not help fix the original issue with the 'Connection reset by peer' error. It seems that the only solution going forward is to force a reauth anytime this exception occurs.

Jeff-Meadows commented 1 year ago

Hi @exdevlin - thanks for the update. I'm glad the specific error has been fixed. I'm going to follow up with a few teams to try to figure out if there's something going on with Box's systems to cause the Connection reset by peer error. I think it's good that your code is prepared to handle that error, but it'll be even better if it doesn't have to handle it very often.

To be clear - you've tried simply retrying the request when you get that error, and it doesn't work? You have to re-auth?

exdevlin commented 1 year ago

Hey @Jeff-Meadows , simply retrying the operation was what we first tried several weeks ago, and it would just leave us in an unending loop of retries. We found that recreating the Client object worked.

Jeff-Meadows commented 1 year ago

Thanks for the info @exdevlin - so it just kept getting the "Connection reset by peer" repeatedly until a re-auth? Sorry, I'm not sure I understood that until now.