Broken pipe: unable to send entire request. ServiceError: 503 service_unavailable c001_v0001009_t0016 is too busy

spinitron commented 4 years ago

The command run from a cron.hourly script was

b2 sync --noProgress --keepDays 14 /home/data/v2 b2://bhs-backup/ > /dev/null

b2 wrote to stderr

ERROR:b2sdk.bucket:error when uploading, upload_url was https://pod-000-1128-03.backblaze.com/b2api/v2/b2_upload_file/redacted/redacted
Traceback (most recent call last): 
File "/usr/local/lib/python2.7/dist-packages/b2sdk/bucket.py", line 615, in _upload_small_file 
content_type, HEX_DIGITS_AT_END, file_info, hashing_stream 
File "/usr/local/lib/python2.7/dist-packages/b2sdk/raw_api.py", line 533, in upload_file 
return self.b2_http.post_content_return_json(upload_url, headers, data_stream) 
File "/usr/local/lib/python2.7/dist-packages/b2sdk/b2http.py", line 297, in post_content_return_json 
response = _translate_and_retry(do_post, try_count, post_params) 
File "/usr/local/lib/python2.7/dist-packages/b2sdk/b2http.py", line 119, in _translate_and_retry 
return _translate_errors(fcn, post_params) 
File "/usr/local/lib/python2.7/dist-packages/b2sdk/b2http.py", line 83, in _translate_errors 
raise BrokenPipe() 
BrokenPipe: Broken pipe: unable to send entire request 
ERROR:b2sdk.bucket:error when uploading, upload_url was https://pod-000-1009-00.backblaze.com/b2api/v2/b2_upload_file/redacted/redacted
Traceback (most recent call last): 
File "/usr/local/lib/python2.7/dist-packages/b2sdk/bucket.py", line 615, in _upload_small_file 
content_type, HEX_DIGITS_AT_END, file_info, hashing_stream 
File "/usr/local/lib/python2.7/dist-packages/b2sdk/raw_api.py", line 533, in upload_file 
return self.b2_http.post_content_return_json(upload_url, headers, data_stream) 
File "/usr/local/lib/python2.7/dist-packages/b2sdk/b2http.py", line 297, in post_content_return_json 
response = _translate_and_retry(do_post, try_count, post_params) 
File "/usr/local/lib/python2.7/dist-packages/b2sdk/b2http.py", line 127, in _translate_and_retry 
return _translate_errors(fcn, post_params) 
File "/usr/local/lib/python2.7/dist-packages/b2sdk/b2http.py", line 60, in _translate_errors 
int(error['status']), error['code'], error['message'], post_params 
ServiceError: 503 service_unavailable c001_v0001009_t0016 is too busy

spinitron commented 4 years ago

This error was logged 4 times between Sep 9 and today

ppolewicz commented 4 years ago

I'm designating this issue to collect several other related issues as in my opinion it shows the core of the problem: c001_v0001009_t0016 (whatever it is) was busy, temporarily impaired, restarted or something in the last couple of weeks (due to drive failures, I would guess). In that time it failed to service requests and b2cli/b2sdk issued several more attempts to recover, but those attempts failed too (#74 shows this prominently as three different error types, all indicating server-side connection issue).

Those kind of issues are rare for b2 cloud, but they do sometimes happen and sometimes the issue is not resolved quickly enough for sync to complete before retry attempts are depleted.

A possible workaround that I've thought about would be a deadline parameter for sync, which would change the retry limiter in such a way that would retry (with capped exponential back-off) until either the operation completes successfully or the deadline expires. This would solve most (if not all) server-related issues reported with sync in the last few years and also it would provide a new feature: dodging primary storage traffic (so that the server is not stressed by the sync operation during some other type of load, which is useful for crontab jobs that may otherwise overlap and overload the system).

@spinitron would that solution work for you? --deadline $(( $(date +%s) +3600)) or --timeout 3600

spinitron commented 4 years ago

I don't know if that would be a solution. Your argument relies on a model of server resource behavior that I cannot judge.

Assuming that connection failures and server resource unavailability are rare and independent of each other then increasing the number of retries (or the period of time to keep retrying) should reduce the failure rate of the overall b2 sync command.

I don't know if the assumption of independence is valid. I can imagine of lots of reasonable causes for connection/server errors to cluster and/or persist.
I don't know about the assumption of rarity. We run one b2 sync command per hour and I get an email every time that command errors. I gave the statistics for each different failure mode in the Github issues I submitted yesterday. I get the error emails sometimes more than once a day.
I assume b2 sync currently makes a number of retry attempts over some time. Increasing those parameters should lead to a marginal improvement in overall failure rate. That margin depends on the current parameters and how much they are increased. I don't know either. And this argument depends on 1. and 2.

Note: The directory I'm syncing has 25 to 30 thousand small plain files distributed in a tree with a few thousand directories. Does the chance of failure of the overall command increase with the number of files/dirs? Again, I don't know.

What to do? I think what you propose might help buy some time. I don't know. But I wouldn't call it a solution. There's only so much that you can accomplish by changing retry parameters in the client. For example, what if the user cannot allow much more time for retries?

I'd think Backblaze would want to find out why this customer routinely gets these errors. Making them less frequent so that I stop reporting them might hide a pathology that needs a different remedy.

ppolewicz commented 4 years ago

I know they do have monitoring and if anything like that happens, they do notice. I also know they really care about the quality of the service they provide - if you look in the history of issues, it is clear that the server-related problems stop being observed pretty quickly after they start happening. They are fixing and improving things all the time. Might want to contact support to get in a direct conversation - this repo is for the python sdk (which, as you say, could handle such errors in a slightly better way).

spinitron commented 4 years ago

That's good to hear. But once again this doesn't sound like a good reason to increase retry times/attempts. On the contrary. It's better if the errors are visible if you're going to fix them.

I thought this (Github) was the way to contact support and that's why I posted each different failure mode. I figured it would be useful to developers.

Is this what you suggest? https://help.backblaze.com/hc/en-us/requests/new

ppolewicz commented 4 years ago

yes

Backblaze / b2-sdk-python

Broken pipe: unable to send entire request. ServiceError: 503 service_unavailable c001_v0001009_t0016 is too busy #73