Backblaze / B2_Command_Line_Tool

The command-line tool that gives easy access to all of the capabilities of B2 Cloud Storage
Other
530 stars 120 forks source link

TruncatedOutput: only 1495314753 of 2121389892 bytes read #554

Open Desto7 opened 5 years ago

Desto7 commented 5 years ago

I'm able to use "b2 download-file-by-name" to download small files, but when I target a 2.1 GB file, it crashes out randomly midway through. (I have run the command at least 10 times over the course of two days. Each time it crashed after having downloaded in the range of 1.4 - 2.0 GB out of 2.1 GB). Reading through the issues page, it seemed that "b2 sync" is recommended. Same issue remains though, crashing out at about 1.7 GB. Since no one else appears to have this rather fundamental problem, I suspect it's related to my region/isp/home network. Still.. any help would be appreciated. I have attached a --debugLog, and pasted a typical command line response here. Thanks in advance

b2_cli.log

CMD:

C:\Users\Desto>b2 download-file-by-name PeterBackup1 PB1/PB.000.000.000.004.pb8 "C:\Users\Desto\Desktop\Matlab\Projects 2018\Peter Backup\PB1_BackupCheck\temp\PB.000.000.000.004.pb8"

Output:

-snip- C:\Users\Desto\Desktop\Matlab\Projects 2018\Peter Backup\PB1_BackupCheck\temp\PB.000.000.000.004.pb8: 70%|7| 1.50G/2.12G [05:37<03:15, 3.20MB/s] ERROR:b2.console_tool:ConsoleTool command error Traceback (most recent call last): File "c:\python27\lib\site-packages\b2\console_tool.py", line 1399, in run_command return command.run(args) File "c:\python27\lib\site-packages\b2\console_tool.py", line 532, in run bucket.download_file_by_name(args.b2FileName, download_dest, progress_listener) File "c:\python27\lib\site-packages\logfury\v0_1\trace_call.py", line 84, in wrapper return function(*wrapee_args, *wrapee_kwargs) File "c:\python27\lib\site-packages\b2\bucket.py", line 168, in download_file_by_name url, download_dest, progresslistener, range File "c:\python27\lib\site-packages\logfury\v0_1\trace_call.py", line 84, in wrapper return function(wrapee_args, **wrapee_kwargs) File "c:\python27\lib\site-packages\b2\transferer\transferer.py", line 115, in download_file_fromurl range, bytes_read, actual_sha1, metadata File "c:\python27\lib\site-packages\b2\transferer\transferer.py", line 122, in _validate_download raise TruncatedOutput(bytes_read, metadata.content_length) TruncatedOutput: only 1495314753 of 2121389892 bytes read ERROR: only 1495314753 of 2121389892 bytes read

Desto7 commented 5 years ago

Anyone any idea?

ppolewicz commented 5 years ago

It looks, as you say, like it's a network problem. The connection between your machine and b2 cloud server has deteriorated to a point of failure and the CLI has reported it (as it should).

A future version of B2 CLI will do more to automatically recover from this type of issues.

Desto7 commented 5 years ago

Thanks for having a look and confirming.

Currently, this issue is completely preventing me from using the command line tool, since I can't download most of my files. (Uploads work fine, so I'm pretty baffled.)

Is there a workaround to keep the connection alive, or to download parts of a file instead?

ppolewicz commented 5 years ago

I can see from the stacktrace that you are using a relatively new version of the CLI, which has parallel transferer (which I have implemented) enabled by default for large files and your file is large.

An obvious workaround would be to split the file to several small files (using 7zip with a very low compression?) and reassemble it upon restore. It's not ideal, but maybe it will work for you?

Actually, I have observed an issue like the one you report here on my workstation during testing of parallel transferer. In my case it was caused by VirtualBox "NAT" network driver, which is known to cause massive issues when performance gets reasonably high. If you are using VirtualBox "NAT" driver, please try to switch to "Bridged" - it resolved the problem instantly in my case (and it improved performance significantly). Alternatively, (since there is no configurability for parallel trasferer parameters in the current version), you can try to revert to b2 CLI version 1.3.6, which always used just one thread to download files, regardless of their size. It may be slower, but more reliable in your case.

Desto7 commented 5 years ago

you can try to revert to b2 CLI version 1.3.6

I would love to try this. Unfortunately, I'm not at all familiar with github (sacrilegious, I know) so I'm just looking up what I need to do exactly. I don't expect you to tutor me on how to actually use this site, but if there happens to be a command line you could tell me that would install CLI 1.3.6 off the bat, I'd love to hear it.

Otherwise, I'll continue my github crash course. Here's where I'm up to: I have downloaded the verified CLI commit of 22 Aug, and I can reach >python, >git, and >java from my command line. Now to compile it... hmm.

ppolewicz commented 5 years ago

@Desto7 pip install b2==1.3.6

and if you'd like to install a version that you have checked out locally, then:

pip install -r requirements.txt
python setup.py install
Desto7 commented 5 years ago

Going to 1.3.6 has fixed my downloading issue. Brilliant!! Thank you so much!

For those curious, 1.3.6 is slower, taking 14 minutes for a 2.1GB file, at a very fluctuating bitrate, as opposed to 8 minutes at my max bitrate when using the latest build.

For my purpose, half speed is fine, so thank you again! Let me know if I should close/mark as solved/ or anything.

ppolewicz commented 5 years ago

@Desto7 could you tell me a little bit more about your environment? Is it a vm, an IoT device, on what network it is etc?

Desto7 commented 5 years ago

It is a W10 computer on a home network, nothing too special. Since you helped me out greatly, I thought I'd do a little test to provide some clarity. I installed CLI on an old W7 machine I had sitting around and connected it to the same network as the W10. And guess what? The latest version of CLI works fine. So I must conclude the problem is limited to my W10 machine. D'oh! It has gotten a few network drivers installed over the years (such as remote lans e.g. EvolveHQ) perhaps one of these is the cause... But it looks like you can rest easy! It was probably an odd mixture of drivers that caused my issue. Though some CLI tools for keeping connections alive, or for continuing failed downloads would always be handy of course.

jimkutter commented 5 years ago

I'm having the same issue occasionally on large (> 1GB) files.

Also on W10, however I run this stuff through WSL.

Not sure how helpful this will be, but it's one more datapoint.

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/b2/console_tool.py", line 1399, in run_command
    return command.run(args)
  File "/usr/local/lib/python2.7/dist-packages/b2/console_tool.py", line 532, in run
    bucket.download_file_by_name(args.b2FileName, download_dest, progress_listener)
  File "/usr/local/lib/python2.7/dist-packages/logfury/v0_1/trace_call.py", line 84, in wrapper
    return function(*wrapee_args, **wrapee_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/b2/bucket.py", line 168, in download_file_by_name
    url, download_dest, progress_listener, range_
  File "/usr/local/lib/python2.7/dist-packages/logfury/v0_1/trace_call.py", line 84, in wrapper
    return function(*wrapee_args, **wrapee_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/b2/transferer/transferer.py", line 115, in download_file_from_url
    range_, bytes_read, actual_sha1, metadata
  File "/usr/local/lib/python2.7/dist-packages/b2/transferer/transferer.py", line 122, in _validate_download
    raise TruncatedOutput(bytes_read, metadata.content_length)
TruncatedOutput: only 1156913275 of 1285087970 bytes read
ppolewicz commented 5 years ago

I'm looking into this

DerekChia commented 5 years ago

I'm getting this error as well and it stops consistently at around 40gb download mark for me.

ERROR:b2.console_tool:ConsoleTool command error
Traceback (most recent call last):
  File "/opt/anaconda/lib/python3.7/site-packages/b2/console_tool.py", line 1399, in run_command
    return command.run(args)
  File "/opt/anaconda/lib/python3.7/site-packages/b2/console_tool.py", line 507, in run
    self.api.download_file_by_id(args.fileId, download_dest, progress_listener)
  File "/opt/anaconda/lib/python3.7/site-packages/logfury/v0_1/trace_call.py", line 84, in wrapper
    return function(*wrapee_args, **wrapee_kwargs)
  File "/opt/anaconda/lib/python3.7/site-packages/b2/api.py", line 175, in download_file_by_id
    return self.transferer.download_file_from_url(url, download_dest, progress_listener, range_)
  File "/opt/anaconda/lib/python3.7/site-packages/logfury/v0_1/trace_call.py", line 84, in wrapper
    return function(*wrapee_args, **wrapee_kwargs)
  File "/opt/anaconda/lib/python3.7/site-packages/b2/transferer/transferer.py", line 115, in download_file_from_url
    range_, bytes_read, actual_sha1, metadata
  File "/opt/anaconda/lib/python3.7/site-packages/b2/transferer/transferer.py", line 122, in _validate_download
    raise TruncatedOutput(bytes_read, metadata.content_length)
b2.exception.TruncatedOutput: only 40719702112 of 79957501184 bytes read
ERROR: only 40719702112 of 79957501184 bytes read
ppolewicz commented 5 years ago

I fixed it in https://github.com/Backblaze/b2-sdk-python/pull/32

rtrainer commented 4 years ago

@ppolewicz I know this has been closed for over a year, but downloading a 104G file kept failing for me and using 1.3.6 fixed the issue.

ppolewicz commented 4 years ago

@rtrainer can you please try with b2cli v2.0? Quite a few things have been rewritten there, it should be correct and faster than 1.3.6

rtrainer commented 4 years ago

I used the latest release which is v2.0.2.

ppolewicz commented 4 years ago

@rtrainer just to be clear: downloading a 104G file kept failing for you with CLI v2.0.2, then you switched to 1.3.6 and it worked fine?

rtrainer commented 4 years ago

That is correct. During the download there were a couple of timeouts but it kept going until between 75GB and 100GB when it would fail. 1.3.6 took longer but worked perfectly with no timeouts. I tried running it on Windows and Ubuntu 18.04.

ppolewicz commented 4 years ago

@rtrainer this did not show up on my tests, clearly there is a difference in the environment. Could you please say a bit more about your environment, specifically everything you can say about your network connection (and any usage of it during the download process), what type of storage device you are writing on (and if anything else is writing to it), what is the age of that device and amount of remaining free space, filesystem type - anything you can think of will help me narrow down the cause. (I know of one potential cause but it seems that in your case it may be something different).

Also I'd like to ask what behavior would you like to see in a perfect world - should the download process retry for a really long time (say, a day) if that's necessary because of a horrible connection? Currently the number of attempts is limited (to 5 per fragment I believe, which is subject to change) and we might want to change that. Finding a solution that you'd be happy with would be a nice starting point.

rtrainer commented 4 years ago

My internet connection is a 1Gb FIOS fiber to my router. I run enterprise grade equipment in my home network. All of my switches are connected with 1GB fiber interconnects and I have a 1 Gb connection to my laptop. My laptop has 64GB RAM, an i7-7700K processor and multiple 1TB Samsung SSD storage device. I watched my network after the first couple of failures and saw nothing to make me believe there was a problem with it.

I would like to see an option for the number of thread for a download and the number of retries. Maybe also the timeout value. This would give me some tools to maybe work around the problem.

I am happy to do whatever testing you would like me to do to help you understand what is going on and hopefully solve this.

ppolewicz commented 3 years ago

Rather than giving you the tools to manually configure the program so that it doesn't crash on you, I'd like to come up with something that will automatically configure itself for you (so if running it on 8 threads causes problems, the number of threads should be decreased until a single thread remains or the problem disappears).

The program needs to know what your exit criteria is though (because otherwise we could just set infinite retries and it would eventually complete - but that's unfeasible for many usecases). Can it be a timeout for the entire (sync) operation?

ppolewicz commented 2 years ago

Backblaze/b2-sdk-python#32 improved this a little bit but the fix is not complete - sync operation can create N*10 threads for downloads which can cause thread starvation and eventually a timeout. Proper threadpool must be introduced.

Lusitaniae commented 2 years ago

Still relevant

I have a bucket full of large files and I'm seeing too many errors

b2 version
b2 command line tool, version 3.2.0
b2sdk.exception.TruncatedOutput: only 1436752953 of 1642003375 bytes read
b2_download(66528199/hourly/snapshot-66882152-ChfbSYmtqJihqnsZ5wnQYwbSbTYmXi1eYQXnUUEweU28.tar.zst, 4_zd17dff9369850c337cdc0e17_f20380946ed246897_d20211118_m110443_c003_v0312010_t0008, /home/user/missing-data/hourly/snapshot-66882152-ChfbSYmtqJihqnsZ5wnQYwbSbTYmXi1eYQXnUUEweU28.tar.zst, 1614472054000): TruncatedOutput() only 1436752953 of 1642003375 bytes read
Exception in thread Thread-694:: 0/81 files   212 / 460 GB   109 MB/s

Average file: 1.6G

Total files:

ls -lha . | wc -l
80

Incomplete files:

ls -lha . | grep sync | wc -l
8
lisa commented 2 years ago

Had this on 3.3.0 as well:

  File "/tmp/b2/lib/python3.8/site-packages/b2sdk/sync/action.py", line 49, in run
    self.do_action(bucket, reporter)
  File "/tmp/b2/lib/python3.8/site-packages/b2sdk/sync/action.py", line 273, in do_action
    downloaded_file.save_to(download_path)
  File "/tmp/b2/lib/python3.8/site-packages/b2sdk/transfer/inbound/downloaded_file.py", line 174, in save_to
    self.save(file, allow_seeking=allow_seeking)
  File "/tmp/b2/lib/python3.8/site-packages/b2sdk/transfer/inbound/downloaded_file.py", line 157, in save
    self._validate_download(bytes_read, actual_sha1)
  File "/tmp/b2/lib/python3.8/site-packages/b2sdk/transfer/inbound/downloaded_file.py", line 113, in _validate_download
    raise TruncatedOutput(bytes_read, self.download_version.content_length)
b2sdk.exception.TruncatedOutput: only 469104401846 of 521137401660 bytes read

On disk the file appears to be its proper size while still retaining the zip.tmp extension, but there is significant missing data which is to be expected from the inability to read around 52GB.

ppolewicz commented 2 years ago

@Lusitaniae version 3.3.0 introduced a common thread pool which should greatly reduce if not eliminate a chance of hitting this error due to a client-side controlled issue.

@lisa As for hitting it post-3.3.0, it's hard to say what this could be. Please run it with --debugLogs and let me take a look, I suspect there may be another cause that leads to the same exception, but happens due to a different issue. The logs will tell.

lisa commented 2 years ago

@ppolewicz I have a pretty sizeable sync in progress. If time permits I'll attempt again with --debugLogs. I'm hoping the sync will continue to fetch this particular snapshot and make a retry unneeded.

ppolewicz commented 2 years ago

Sync aborts when a failure is belived to be permanent (when the retries are exhausted). If it's still running then maybe a download request encountered a failure, sync retried the download and it passed.

lisa commented 2 years ago

@ppolewicz I have attached a redacted debug log
b2_cli-REDACTED.log (redacting bucket ID, file ID and filenames) which shows three backtraces.

There are six snapshot zip files to retrieve, so three backtraces represents a 50% failure rate (so far) and another financial cost to retry.

ppolewicz commented 2 years ago

There are many failures hidden there and in consequence of exhausting the retry limit individual file downloads are being failed.

I think what we should focus on here is finding out how to improve the situation to prevent the partial failures from occuring, not retrying on failures more.

@lisa thank you for redacting the private stuff, I appreciate it. In fact I think this should be redacted by default, we even had a ticket for diagnostic data anonymization somewhere.

One of the known reasons behind the failed download is too slow output device and another one is too slow network. Unfortunately there is no automatic diagnostic information collected by the client to determine where the bottleneck is (practically no software does report this kind of information, which doesn't mean we shouldn't, but we haven't done this type of reporting yet).

Before we pump a ton of hours into a new reporting system we may never use again, I'd like to ask:

  1. Where are you writing this? Is it GCP provisioned network drive with super limited IOPS maybe, or maybe a NAS with a recovering drive at the time of download?
  2. What are you downloading this over? Hopefully not a sattelite receiver on a stormy weather :)
lisa commented 2 years ago

One of the known reasons behind the failed download is too slow output device and another one is too slow network. Unfortunately there is no automatic diagnostic information collected by the client to determine where the bottleneck is (practically no software does report this kind of information, which doesn't mean we shouldn't, but we haven't done this type of reporting yet).

Before we pump a ton of hours into a new reporting system we may never use again, I'd like to ask:

  1. Where are you writing this? Is it GCP provisioned network drive with super limited IOPS maybe, or maybe a NAS with a recovering drive at the time of download?
  2. What are you downloading this over? Hopefully not a sattelite receiver on a stormy weather :)

The restore is being done in Southern Ontario (Greater Toronto Area) with a 250/20 Mbit connection, with network priority largely being given to the host performing the sync. The target is a NAS appliance with the target volume in a three-member RAID5 configuration. To be clear, the sync is taking place directly on the NAS and not via a network share via a wired connection. Presently there are no other significant activities taking place on the device beside the sync (the device is not serving any active network shares).

ppolewicz commented 2 years ago

Oh, ok. @lisa please check the health of:

I strongly suspect a NAS device has limited memory and is swapping, causing overall overload and inability to receive data fast enough, which is causing the server to drop the connection when it realizes no data has been successfully received over the socket in the last 2 minutes (that was the observed (undocumented) behavior of the server ~3 years ago, but I doubt that changed).

I managed to run into a similar situation a while back due to very bad network card driver which got overwhelmed at like 14Mbps. That was a VM though, not a NAS.

lisa commented 2 years ago

Oh, ok. @lisa please check the health of:

  • the NAS itself (make sure it has free ram, is not swapping, is not overloaded on the CPU side, especially on iowait (top generally shows these)
  • the RAID array. Perhaps a drive is rebuilding - with large HDDs that can take like a week even if there is no load and some devices slow down rebuild for user load
  • network stack (mainly link speed and drop/retransmitted packets - sometimes when a cable is faulty linux decides to stick to 10Mbps until it's told to do otherwise)
  • if that doesn't give any clues, tests performed via hdparm device write speed test and download speed test could eliminate threats of bad drive, bad NIC, bad cable

I strongly suspect a NAS device has limited memory and is swapping, causing overall overload and inability to receive data fast enough, which is causing the server to drop the connection when it realizes no data has been successfully received over the socket in the last 2 minutes (that was the observed (undocumented) behavior of the server ~3 years ago, but I doubt that changed).

I managed to run into a similar situation a while back due to very bad network card driver which got overwhelmed at like 14Mbps. That was a VM though, not a NAS.

While this latest sync is in progress I'm not willing to benchmark the device (to leave disk i/o for this sync process), but the RAID volume is not rebuilding. There are no apparent networking issues:

lisa@osmium:/volume1/lifeboat/Restore$ ifconfig eth0
eth0      Link encap:Ethernet  HWaddr XX:XX:XX:10:38:27
          inet addr:aaa.bbb.ccc.ddd  Bcast:aaa.bbb.ccc.255  Mask:255.255.255.0
          inet6 addr: *redacted*
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:11015721726 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3631560901 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:16620212467945 (15.1 TiB)  TX bytes:256642476411 (239.0 GiB)
          Interrupt:96 base 0xa000

The NAS has plenty of memory available; 8 GB. CPU is not taxed.

lisa@osmium:/volume1/lifeboat/Restore$ free -m
              total        used        free      shared  buff/cache   available
Mem:           7792         519         133           9        7139        6967
Swap:          6723         317        6406

Yes, there is swap in use, however the rate of swapping is very low (gauged by the built-in metrics graphing utility), with only a few pages at a time, long stretches between each.

I have no reason to doubt the physical components of the network (including the network cable) considering how many packets have transited the link without issue.

ppolewicz commented 2 years ago

I see. Perhaps you could run a very quick hdd test, lasting ~1s?

lisa commented 2 years ago

I ran some tests:

lisa@osmium:/volume1/lifeboat/Restore$ mount | grep volume1
/dev/mapper/cachedev_0 on /volume1 type btrfs (rw,nodev,relatime,ssd,synoacl,space_cache=v2,auto_reclaim_space,metadata_ratio=50,block_group_cache_tree,subvolid=256,subvol=/@syno)

lisa@osmium:/volume1/lifeboat/Restore$ sudo hdparm -tT /dev/mapper/cachedev_0
/dev/mapper/cachedev_0:
 Timing cached reads:   7242 MB in  2.00 seconds = 3622.56 MB/sec
 Timing buffered disk reads: 1388 MB in  3.00 seconds = 461.96 MB/sec

lisa@osmium:/volume1/lifeboat/Restore$ time dd if=/dev/zero of=/volume1/lifeboat/writetest bs=4096 count=2441407
2441407+0 records in
2441407+0 records out
10000003072 bytes (10 GB, 9.3 GiB) copied, 29.8649 s, 335 MB/s

real    0m29.910s
user    0m0.606s
sys 0m12.138s

They didn't even seem to have any impact on the still-running sync which was nice. ;)

When I dedicate my entire downlink to the sync I hit approximately 30MB/s down, which is an order of magnitude less than the read and write speeds for the RAID volume.

ppolewicz commented 2 years ago

Ok, as a sysadmin I would do a few more things:

  1. Check if CPU is not a bottleneck using top, look if maybe softirq or io wait are high which would indicate bottleneck on seeks which was not detected by hdparm/dd
  2. Reconfigure the CLI: use b2 sync --syncThreads 1 --downloadThreads 5. Actually I'd experiment with the download threads value, running it for like 15s with incrementally raising number starting from 1 to see when it will stop increasing the speed and I'd leave it at one more thread than would be needed to get maximum performance, or I'd reduce it to almost max performance, depending what's happening in the network
  3. Try --noProgress, it might help in case of some low performance terminals
  4. Run ping from a screen in parallel with tcpdump that writes out data into separate files (-w, -C). The thesis is that the network is periodically impaired by flapping BGP somewhere between your network and B2 network. Probably wise to also run a few "unrelated" pings to cloudflare (1.1.1.1), google (8.8.8.8) etc
  5. Generate download urls using CLI, then sequentially download the files using wget. Network is only 30MB/s, wget should be able to almost saturate it with a single thread even on high latency, but you are fairly close so you should be fine with one thread
  6. Try another tool to run the download of individual files, rclone or something
grapevine2383 commented 2 years ago

trying to download a 700GB File using b2 3.4.0 download-file-by-id and it keeps failing around 500-700 GB's with "ERROR: only x of x bytes read". I've retried multiple times already. Any way to get it to download fully or at least resume if it fails so I don't have to retry all the way from the beginning?

ppolewicz commented 2 years ago

@jimmaay the most likely cause is that the number of threads is too high for your environment, please try to run it with --downloadThreads 5

grapevine2383 commented 2 years ago

I used aria2c multi threaded downloader set at 12 threads to download through HTTPS without error. This was on a high performance 24 thread dedicated server without much running as it's part of a migration. downloadThreads defaults to 10 so not sure why it's having issues when a multi threaded http download using aria2c with more threads doesn't have any issues?

ppolewicz commented 2 years ago

In order for B2 cli to fail it must have have experienced an error (usually timeout) on one of the file parts 5 times. Perhaps your network was shaky when you tried to download with CLI or perhaps this other tool tries more than 5 times? Not sure.

We should probably rethink the retry policy 🤔

grapevine2383 commented 2 years ago

I've also been using wget successfully as well and its a default of 20 retries. Also when I used b2 to download the download sometimes gets frozen at a certain percentage for a couple minutes or more meanwhile wget or other tools are just constant progress

ppolewicz commented 2 years ago

I am personally really sensitive about download performance and reliability. Someone calling b2 cli "worse than wget" is definitely not what I want. I suspect that it is environmental, but even if it is, we should be detecting the problem and reporting it in a way very clear for the user.

It seems that this issue is reproducible on your environment. Would you run some additional diagnostics, potentially a version of the cli with extra logging capabilities, to get to the bottom of this issue?

grapevine2383 commented 2 years ago

The server is in production right now so i cant mess around with it but it was the latest version of almalinux 8 without much extra installed at the time. Server is located in US. If you could add a debug option in a future version ill run it with debug enabled next time

On Wed., Jun. 8, 2022, 11:35 p.m. Paweł Polewicz, @.***> wrote:

I am personally really sensitive about download performance and reliability. Someone calling b2 cli "worse than wget" is definitely not what I want. I suspect that it is environmental, but even if it is, we should be detecting the problem and reporting it in a way very clear for the user.

It seems that this issue is reproducible on your environment. Would you run some additional diagnostics, potentially a version of the cli with extra logging capabilities, to get to the bottom of this issue?

— Reply to this email directly, view it on GitHub https://github.com/Backblaze/B2_Command_Line_Tool/issues/554#issuecomment-1150728385, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJQD7IYA3LNBEF6I4D3F5LVOGGENANCNFSM4G2A5R5Q . You are receiving this because you commented.Message ID: @.***>

darkBuddha commented 2 years ago

I am trying to download a big file on a dedicated server, we have redundant fiber. For the 5th time, this b2 cli application just randomly stops. Why is it happening? I lost around $30 on transfer costs now, for NOTHING. This is not what i expected from Backblaze at all.

totaltentigers commented 2 years ago

I got the same issue today with a very large file that takes a full 24 hours to download at gigabit speeds. 30 minutes before the download was supposed to complete, it errored out. This is the second time I've attempted to download my file, costing me $150 already. The b2 tool seriously needs a way to resume an incomplete download

grapevine2383 commented 2 years ago

For anyone with this issue I'd recommend using curl, wget, aria2c or any other reliable HTTPS downloaders for now. I thought it was an issue with B2's network this whole time with freezes, slow speeds and errors but using wget and aria2c many times since stopping use of this tool and it's been smooth sailing. I haven't gotten any errors, freezing, or slowness since switching and I've downloaded ~15 TB's worth of files and can easily pause/resume with these tools too even. It sucks to hear of so much waste of time, money, and bandwidth.

For the developers at backblaze of this tool or the b2 python sdk that this uses, I'd recommend switching the b2 HTTP library to something more reliable like pycurl (libcurl) since I can see it's just utilizing the B2 REST API underneath.

totaltentigers commented 2 years ago

For anyone with this issue I'd recommend using curl, wget, aria2c or any other reliable HTTPS downloaders for now. I thought it was an issue with B2's network this whole time with freezes, slow speeds and errors but using wget and aria2c many times since stopping use of this tool and it's been smooth sailing. I haven't gotten any errors, freezing, or slowness since switching and I've downloaded ~15 TB's worth of files and can easily stop/resume with these tools too even.

For the developers of this tool, I'd recommend switching the b2 HTTP library to something more reliable like curl since I can see it's just the utilizing the B2 REST API underneath

Thanks for the suggestion. Downloading a very large file (~10 TB) has been extremely frustrating with 2 failed attempts already. I've heard good things about aria2 and will give it a shot.

darkBuddha commented 2 years ago

For everybody that also lost time and money due to the buggy official B2 client (WTF), you get the HTTPS link via

b2 get-download-url-with-auth --duration 604800 b2-snapshots-b12345678901 bzsnapshot_2009-01-01-01-01-01.zip

You can then download it in the background, not attached to a terminal session, with a production-ready well-tested downloader that doesn't have random hiccups, via

nohup wget -q https://abcd.backblazeb2.com/file/b2-snapshots-b12345678901/bzsnapshot_2009-01-01-01-01-01.zip\?Authorization\=XXXXXXXXXXXXXX -O snapshot.zip &

@ppolewicz you should openly communicate that this tool is not production-ready. GitHub issues is not the place for this information. Let's see if Backblaze will refund my charges, i will report.

Good luck everybody!

totaltentigers commented 2 years ago

Just leaving an update that aria2 in a Docker container was able to successfully download my 10 TB file. It took about 2.5x as long as b2 but I had no issues. I'll also be trying to get a refund for my charges while using b2

ppolewicz commented 2 years ago

It is interesting to see those reports of failed downloads appearing 3 years after the last major change was done to the implementation of the download system. It suggests that something other than the code might have changed here - maybe it's the network conditions around Backblaze ISP, maybe it's something else. I'd like to find out and, if possible, try to adjustment the code to make it so users don't hit the condition that causes their transfers to fail.

B2 CLI is production ready for a long time, but as any software, opensource or not, it's not free of bugs. Here, at least, if you are hitting a serious bug, you can count on the bug being addressed. If you help by providing diagnostic information, that is.

Apparently using a single-threaded client such as wget, with its 20 retry attempts, allows you to download a huge file successfully while using b2 cli, with 5 retries per part, fails. The scenario I guess is taking place here, is that there is some kind of a failure either on the b2 clusters or somewhere on the network. I say "guess", because we don't really know, because nobody ran the download with --debugLogs, which would produce a file that would tell us more about what happened (specifically whether it was a network failure or a server failure and what kind of a problem it was exactly).

In 2007 I used to have a major problem with a WAN system where network on source server in Poland was fine, network on the destination server in France was also fine, but there was an overloaded internet exchance point in Frankfurt, Germany, that was dying from high load at peak time. Not even every day. Now what this has caused, was a large packet loss, which caused BGP servers around it to reroute the traffic around this IX. 5 minutes later the traffic was much lower on the IX, packet loss went back to zero, so the BGP servers rerouted the traffic back to the IX causing an overload and this happened in a cycle for an embarrasingly long amount of time, causing 30s traffic gaps in 5 minute intervals. That may have been 15 years ago, but as far as I know, not much has changed on that layer and when I analyse traffic patterns nowadays, I still see the same behavior (especially on DigitalOcean for some reason).

It may also be that a Backblaze server on one of the clusters that you use is experiencing a failure that takes a few minutes to recover. B2 uses erasure coding, not replicas, to maintain the consistency of the data they store and in consequence the storage they offer has very competitive pricing, however the way the system is operated is somewhat different. I don't know any of the details of this system, but I worked on another system based on erasure coding and let me tell you, designing something to work despite hard drives dying left and right is an engineering challenge. That being said, from my experience with B2 storage, it handles things quite well. I do not put 10TB files on there as @jakemoritz does - in fact I'd love to talk a bit more about those as I'm pretty sure some nice optimizations can be put in place if we can understand this type of a use case better. That's for the future though, now back to the matter at hand.

wget is patient - it'll retry 20 times before giving up and if one of those longer (or worse - repetitive) failures is experienced, wget is able to work around it. In this particular scenario B2 cli in its current version may not be as patient as wget, especially, let me add, when at the very end of the file (please see this comment by @jakemoritz!)

The one thing I can do now, is to change the internal implementation detail of b2-sdk-python to increase the amount of retries per part from 5 to 20. In consequence a complete failure of network on a client machine would not cause the cli to exit as quickly as it does before the change, but on the other hand I hope nobody will call the B2 CLI "worse than wget" ever again.

ppolewicz commented 2 years ago

I had to rewrite the exponential backoff algorithm or else the 20th retry would wait almost an hour