Backblaze / B2_Command_Line_Tool

The command-line tool that gives easy access to all of the capabilities of B2 Cloud Storage
Other
538 stars 122 forks source link

TruncatedOutput: only 1495314753 of 2121389892 bytes read #554

Closed Desto7 closed 1 month ago

Desto7 commented 5 years ago

I'm able to use "b2 download-file-by-name" to download small files, but when I target a 2.1 GB file, it crashes out randomly midway through. (I have run the command at least 10 times over the course of two days. Each time it crashed after having downloaded in the range of 1.4 - 2.0 GB out of 2.1 GB). Reading through the issues page, it seemed that "b2 sync" is recommended. Same issue remains though, crashing out at about 1.7 GB. Since no one else appears to have this rather fundamental problem, I suspect it's related to my region/isp/home network. Still.. any help would be appreciated. I have attached a --debugLog, and pasted a typical command line response here. Thanks in advance

b2_cli.log

CMD:

C:\Users\Desto>b2 download-file-by-name PeterBackup1 PB1/PB.000.000.000.004.pb8 "C:\Users\Desto\Desktop\Matlab\Projects 2018\Peter Backup\PB1_BackupCheck\temp\PB.000.000.000.004.pb8"

Output:

-snip- C:\Users\Desto\Desktop\Matlab\Projects 2018\Peter Backup\PB1_BackupCheck\temp\PB.000.000.000.004.pb8: 70%|7| 1.50G/2.12G [05:37<03:15, 3.20MB/s] ERROR:b2.console_tool:ConsoleTool command error Traceback (most recent call last): File "c:\python27\lib\site-packages\b2\console_tool.py", line 1399, in run_command return command.run(args) File "c:\python27\lib\site-packages\b2\console_tool.py", line 532, in run bucket.download_file_by_name(args.b2FileName, download_dest, progress_listener) File "c:\python27\lib\site-packages\logfury\v0_1\trace_call.py", line 84, in wrapper return function(*wrapee_args, *wrapee_kwargs) File "c:\python27\lib\site-packages\b2\bucket.py", line 168, in download_file_by_name url, download_dest, progresslistener, range File "c:\python27\lib\site-packages\logfury\v0_1\trace_call.py", line 84, in wrapper return function(wrapee_args, **wrapee_kwargs) File "c:\python27\lib\site-packages\b2\transferer\transferer.py", line 115, in download_file_fromurl range, bytes_read, actual_sha1, metadata File "c:\python27\lib\site-packages\b2\transferer\transferer.py", line 122, in _validate_download raise TruncatedOutput(bytes_read, metadata.content_length) TruncatedOutput: only 1495314753 of 2121389892 bytes read ERROR: only 1495314753 of 2121389892 bytes read

darkBuddha commented 2 years ago

@ppolewicz just use http://pycurl.io/. No need to reinvent the wheel. Let TCP handle any external hiccups.

I am quite sure there were none in my case, I assume that the files people move just got bigger over time, and now minor design flaws in the CLI app have more exposure than before, and unit tests probably don't use 10 TB files in real, external data centers. I think it does not matter if the amount of retries is 5 or 20 or 1000, and BGP, failing HDDs etc. play no role here.

I did not know about the debug flag and hence didn't use it, but can give you some facts:

I would be happy to provide any additional required information to help fix the bug. Thanks for looking into it.

ppolewicz commented 2 years ago

B2 cli and b2-sdk-python use python-requests, a de-facto standard library for http requests in python. Switching to pycurl might be possible, but currently b2cli runs on just about anything and if we use pycurl, it'll only run on systems which have pycurl, which is not ideal. We could try to make it so that we use pycurl if libcurl is available and otherwise stay with requests, though then the entire application would follow the same internal protocol for both http library implementations and at that point I seriously doubt that it would fix whatever bug you have encountered.

The logic that we add on top of the http wrapper is quite significant - we can download faster than with just one thread, making it so that on a fast CPU you can download faster than a single stream Backblaze can output, which is especially important on larger latencies. If a part breaks, we continue the download from the byte on which it was interrupted. On fatal errors (quote exceeded, for example), we exit rapidly, while on retryable errors, we retry and we comply with the exponential backoff requests mandated by B2 integration checklist. We renegotiate tokens if they expire and so on - and that's just downloads, because on upload we do so even more cool stuff.

Whatever caused your issue, switching to pycurl is unlikely to help - in the end requests and pycurl do basically the same thing on http level and from my experience most of the issues like the one you've encountered are caused by network or server.

We can't do much to address the actual problem until we know what it is and for that we need a failed download operation ran with --debugLogs.

taylorwood commented 1 year ago

I've been trying to download a 750GB snapshot from B2 for a week. Here are the things I've tried:

  1. Download snapshot using Safari: download won't even start ❌
  2. Download snapshot using Chrome: every attempt fails after a few hours ❌
  3. Install Backblaze Downloader for Mac: doesn't download snapshots, and my file selection is too big for the Restore feature ❌
  4. Install B2 CLI with Homebrew, create App Key, authorize, use b2 sync to sync the snapshot bucket to a local drive; failed with TruncatedOutput error at varying times over ~5 attempts ❌
  5. Install aria2c with Homebrew, download authorized URL ✅
    1. Use b2 get-download-url-with-auth to get HTTPS URL with authorization for the snapshot file:
      b2 get-download-url-with-auth --duration 604800 b2-snapshots-1234 files_2022-11-15-12-26-04.zip
    2. Download the resulting URL using aria2c with all default settings; works on the first try!

After this experience I can safely say: if you have a large file/snapshot to download from Backblaze/B2, you're better off ordering a physical drive from them (then return for refund), or ignoring all their documentation and using a 3rd party downloader like wget, aria2c, etc.

It's not my issue to reopen but it doesn't seem resolved to me.

ppolewicz commented 1 year ago

@taylorwood - it seems that your network connectivity to Backblaze is not very stable - the last patch I did for this massively increased the number of retry attempts to 20. Only after all of those are exhausted TruncatedOutput error is returned. Now that I think about it, we might improve the communication to the user by wording the error slightly differently. It won't change much though, your network (or storage device, or CPU starvation) will still have issues.

Since we are talking here and I've been planning to address this problem for quite a while now, let me ask you a question.

What should B2 command line tool do when the connection breaks repeatedly? In current design after it breaks 20 times, we give up. It seems that it's not what you want. Would you like to have a mode where there is an unlimited amount of retries and instead the process stops after some other condition is hit, such as time without receiving any data or total elapsed time or something like that?

taylorwood commented 1 year ago

@ppolewicz networks are unreliable, but I don't think mine is any less reliable than the average US fiber subscriber.

Backblaze specializes in backing up huge amounts of data which may include very large files. I'd suggest at least offering a more robust retry/resume option for networks you deem unreliable. The responses so far mostly shift responsibility onto the end users, who are then seeking out and successfully using other tools.

darkBuddha commented 1 year ago

For my case, I am sure that there were no network issues. Why are such errors not handled transparently by TCP anyway? To me, the entire threading thing looks obsolete, why introduce unnecessary complexity. A single connection will also fill the pipe.

ppolewicz commented 1 year ago

The single connection will NOT fill the pipe even in a case where everything is going according to plan. I've ran the tests and have seen it with my own eyes. You need to think about a case where TCP bandwidth delay product exceeds the size of the server send buffer. I'd argue that in this case TCP is making things slower - TCP is for streams but nowadays everybody uses it for retrieving files where additional optimizations could be added (specifically not keeping the entire send buffer in ram where the data can be retrieved from a drive in case it gets lots in flight). It does tolerate quite a bit as @darkBuddha observed, but even it has a limit. Then when TCP gives up, we try the entire connection again, starting from the place in which the stream has ended, so before declearing a failure it's TCP trying to fix things and failing 20 times over.

Sometimes storage servers (all of them) need to rebuild data from a broken drive and that slows things down. If the file is large enough, it's hard to imagine every single server serving it being 100% in perfect health all the time. This is why parallelization improves things. RAID-1 based services like AWS S3 are more impacted by this while B2 handles this case exceptionally well due to erasure coding... But yes, drives of B2 fail too - there's even a lot of documentation on Backblaze blog about it.

The current implementation of the CLI allows for more retries than wget (specifically same as wget for small files and same as wget per thread). Maybe some other tool allows for more or maybe some of the tools you've been running this with have been ran at a period where the network was more stable. We can only guess, because nobody wants to share the logs of a failing download with me. As seen in other threads, I'm willing to go all the way to debug stuff - I'll help analyze a tcpdump to see what happened, I'm adding more logs and fixing everything I can. This is because I'm a bit obsessed with download reliability, having worked at a company that dealt with backup restores a lot.

In the recent months, I've even added extra statistics for the download parts, where --debugLogs saves very useful information about download performance, showing clearly whether it's the network or storage device (or both) that the program needs to wait for. During the testing of this I've downloaded a few terabytes of data and I didn't have any problems with reliability, not even once - but the network conditions were perfect as I've picked an AWS datacenter sitting right next to Backblaze... Which again proves my point about the network conditions. If the network is good, all works well. If it's not, TCP retries a few times, then we multiply that by 20 per thread. It's a lot of retries.

What I think we could improve, is communication - using python 3 exception chaining we could gather the information about the failures and maybe some extra about network/device performance (when a certain flag is provided). Then if something wouldn't work, we'd have more information about it from the program output even if logs are not enabled. I'll look into it.

darkBuddha commented 1 year ago

Why not just continue the download, maybe with a warning? Why is it written in Python and not Go?

ppolewicz commented 1 year ago

In many cases having a process hang forever is not desired. Would you like there to be an option that keeps trying until the user stops it with a ctrl+c or SIGTERM or something like that?

The CLI was written in python in 2015. I think the original author was not proficient in golang at the time, that's why he used Python for it and since then there was no good reason to rewrite it to golang.

darkBuddha commented 1 year ago

Whoever is downloading important data, will most likely see that it doesn't show progress, and start looking into it. No reason to throw away the successfully downloaded chunks

seliquity commented 1 year ago

Paweł, surely the solution is to make a functioning resume system? If you're trying to restore 100 GB file, then that's going to take sufficiently long people may have network issues or indeed just need to restart their computer.

Regarding nobody sending you debug logs, I sent my debug log over to support to look at back in September (I don't wish to share it publicly), but I've also attached an extract here which shows the actual error I've seen repeatedly. I have a symmetric gigabit fibre line so I don't think my connection is unstable, and judging by this debug log it seems the problem is more likely to be with the connection pooling system?

debug_log.txt

darkBuddha commented 1 year ago

The traffic is paid for, so automatic deletion of successfully downloaded chunks should never be possible. How is this not a major requirement?

From what I read here, most people that try to download huge backups, have access to Gigabit fiber anyway. So the pipe can't be filled anyway, even if you use 200 threads. The focus should be on reliability and cost reduction.

Imagine it the other way around: What if the BB uploader would delete all progress when some network issue or bug occurs, after 2-3 TB of successful upload. This data should be kept.

Lusitaniae commented 1 year ago

Basically I have a bucket full of "snapshots"

Each snapshot is about 1-2tb of a compressed archive, plus 2 text files with some metadata

I'm trying to download a batch of snapshots for local processing using b2 sync with regex, abbreviated command below:

B2_ACCOUNT_INFO=/etc/b2/b2_account b2 sync --debugLogs --includeRegex "(47951888/archive*|47951888/metadata1*|47951888/metadata2*|47951888/archive*|47951888/metadata1*|47951888/metadata2*)" --excludeRegex ".*" b2://bucket-name/ /tmp/workdir

Basically every time I try to run it it will fail with TruncatedOutput

I could send you full log privately, but some excerpts here:

created part to download: PartToDownload(Range(0, 12198148352), Range(0, 12198148352))
created part to download: PartToDownload(Range(12198148353, 24396296705), Range(12198148353, 24396296705))
created part to download: PartToDownload(Range(24396296706, 36594445058), Range(24396296706, 36594445058))
created part to download: PartToDownload(Range(36594445059, 48792593411), Range(36594445059, 48792593411))
created part to download: PartToDownload(Range(48792593412, 60990741764), Range(48792593412, 60990741764))
created part to download: PartToDownload(Range(60990741765, 73188890117), Range(60990741765, 73188890117))
created part to download: PartToDownload(Range(73188890118, 85387038470), Range(73188890118, 85387038470))
created part to download: PartToDownload(Range(85387038471, 97585186823), Range(85387038471, 97585186823))
created part to download: PartToDownload(Range(97585186824, 109783335176), Range(97585186824, 109783335176))
created part to download: PartToDownload(Range(109783335177, 121981483529), Range(109783335177, 121981483529))
calling ThreadPoolExecutor(max_workers=10, thread_name_prefix='', initializer=None, initargs=())
calling ThreadPoolExecutor(max_workers=10, thread_name_prefix='', initializer=None, initargs=())
download attempts remaining: 5, bytes read already: 0. Getting range Range(12198148353, 24396296705) now.
download attempts remaining: 5, bytes read already: 0. Getting range Range(24396296706, 36594445058) now.
Starting new HTTPS connection (4): f003.backblazeb2.com:443
download attempts remaining: 5, bytes read already: 0. Getting range Range(36594445059, 48792593411) now.
download attempts remaining: 5, bytes read already: 0. Getting range Range(48792593412, 60990741764) now.
Starting new HTTPS connection (5): f003.backblazeb2.com:443
download attempts remaining: 5, bytes read already: 0. Getting range Range(60990741765, 73188890117) now.
download attempts remaining: 5, bytes read already: 0. Getting range Range(73188890118, 85387038470) now.
download attempts remaining: 5, bytes read already: 0. Getting range Range(85387038471, 97585186823) now.
download attempts remaining: 5, bytes read already: 0. Getting range Range(97585186824, 109783335176) now.
Starting new HTTPS connection (6): f003.backblazeb2.com:443
Starting new HTTPS connection (7): f003.backblazeb2.com:443
download attempts remaining: 5, bytes read already: 0. Getting range Range(109783335177, 121981483529) now.
Starting new HTTPS connection (8): f003.backblazeb2.com:443
Starting new HTTPS connection (9): f003.backblazeb2.com:443
Starting new HTTPS connection (10): f003.backblazeb2.com:443
Starting new HTTPS connection (11): f003.backblazeb2.com:443
Starting new HTTPS connection (12): f003.backblazeb2.com:443

download stats | /storage/workdir/48815904/rocksdb.tar.bz2.b2.sync.tmp[writer] | TTFB: 0.003 ms
download stats | /storage/workdir/48815904/rocksdb.tar.bz2.b2.sync.tmp[writer] | read() without TTFB: 408371.554 ms
download stats | /storage/workdir/48815904/rocksdb.tar.bz2.b2.sync.tmp[writer] | seek total: 501.646 ms
download stats | /storage/workdir/48815904/rocksdb.tar.bz2.b2.sync.tmp[writer] | write() total: 40836.703 ms
download stats | /storage/workdir/48815904/rocksdb.tar.bz2.b2.sync.tmp[writer] | overhead: 425.336 ms
download attempts remaining: 5, bytes read already: 0. Getting range Range(12155451033, 24310902065) now.
Resetting dropped connection: f003.backblazeb2.com

download stats | /storage/workdir/50544015/rocksdb.tar.bz2.b2.sync.tmp[writer] | TTFB: 0.004 ms
download stats | /storage/workdir/50544015/rocksdb.tar.bz2.b2.sync.tmp[writer] | read() without TTFB: 934607.596 ms
download stats | /storage/workdir/50544015/rocksdb.tar.bz2.b2.sync.tmp[writer] | seek total: 481.209 ms
download stats | /storage/workdir/50544015/rocksdb.tar.bz2.b2.sync.tmp[writer] | write() total: 37122.464 ms
download stats | /storage/workdir/50544015/rocksdb.tar.bz2.b2.sync.tmp[writer] | overhead: 398.623 ms

download stats | /storage/workdir/50544015/rocksdb.tar.bz2.b2.sync.tmp | finish_hash total: 49619.662 ms
2023-07-06 08:13:09,559 704505  140397540484864 b2sdk.sync.action   ERROR   an exception occurred in a sync action
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/b2sdk/sync/action.py", line 53, in run
    self.do_action(bucket, reporter)
  File "/usr/local/lib/python3.8/dist-packages/b2sdk/sync/action.py", line 325, in do_action
    downloaded_file.save_to(download_path)
  File "/usr/local/lib/python3.8/dist-packages/b2sdk/transfer/inbound/downloaded_file.py", line 204, in save_to
    self.save(file, allow_seeking=allow_seeking)
  File "/usr/local/lib/python3.8/dist-packages/b2sdk/transfer/inbound/downloaded_file.py", line 187, in save
    self._validate_download(bytes_read, actual_sha1)
  File "/usr/local/lib/python3.8/dist-packages/b2sdk/transfer/inbound/downloaded_file.py", line 143, in _validate_download
    raise TruncatedOutput(bytes_read, self.download_version.content_length)
b2sdk.exception.TruncatedOutput: only 109592269439 of 121769188265 bytes read

download stats | /storage/workdir/47951888/rocksdb.tar.bz2.b2.sync.tmp[writer] | TTFB: 0.757 ms
download stats | /storage/workdir/47951888/rocksdb.tar.bz2.b2.sync.tmp[writer] | read() without TTFB: 1458665.578 ms
download stats | /storage/workdir/47951888/rocksdb.tar.bz2.b2.sync.tmp[writer] | seek total: 523.423 ms
download stats | /storage/workdir/47951888/rocksdb.tar.bz2.b2.sync.tmp[writer] | write() total: 38689.805 ms
download stats | /storage/workdir/47951888/rocksdb.tar.bz2.b2.sync.tmp[writer] | overhead: 423.280 ms
download stats | /storage/workdir/47951888/rocksdb.tar.bz2.b2.sync.tmp | finish_hash total: 47839.309 ms
tteggelit commented 1 year ago

I just uploaded about 250GB of data (about 5.5k files in size from 10s of kB to 1+ GB) using the sync command and the upload had no problems. I wanted to verify the data was successfully transferred so then I downloaded the data again - using the same device I uploaded with, just to a different location. The vast majority of files sync'd down successfully the first, but about 150 were reliably failing on every single attempt. I attempted it 5 times (getting charged for about 450GB of data downloads). The file sizes were between 10s MB to 1+ GB of the files that failed. At one point I even deleted all the files that were failing from B2 and re-uploaded them (using sync) and they still continued to fail to download. This is all using a reliable 1Gb network connection. Reviewing the comments here I decided to try --threads 1 with the download and it worked flawlessly the first time for all files.

ppolewicz commented 1 year ago

@tteggelit this might have been caused by network conditions, where something that supervises your connection (anti-botnet, anti-ddos or something like that) might really not like you to overload the network with a lot of connections. In your case, however, you only have 1Gbe connection, so assuming no overhead it should be capable of transferring 125MB/s or so. The speed of B2 CLI is 610MB/s (on Python 3.11 on Linux), so you will easily saturate the network with just a single thread (on low latency - on larger latency you can use more). Running more threads than your network can handle can lead to broken transfers.

The documentation of sync says:

Users with low-performance networks may benefit from reducing the number of threads. Using just one thread will minimize the impact on other users of the network.

ppolewicz commented 1 year ago

@Lusitaniae TruncatedOutput is usually a consequence of irrepairable error that happened before it, it just indicates that the download has failed and retries have also failed and now we ended up with a file that is not fully downloaded so we are going to stop the sync operaiton completely. I think we are going to change this behavior in B2 CLI 4.0 to allow users to keep recovery attempts for a set period of time rather than 5 attempts per fragment, but if a fragment fails to download 5 times in a row, it's a problem and we should find what it is. Are there other exceptions in the logs that you haven't shown?

The performance counters show that it's mainly the network you are waiting for. This is expected behavior - it just means that your storage device is not too slow (23min waiting for network, 38s waiting for the drive).

tteggelit commented 1 year ago

@ppolewicz Blaming the network doesn't quite explain why the upload - with what I assume uses the same default thread concurrency - worked fine and why the vast majority of the downloaded files (like 97%) worked the first time, but 5 subsequent attempts at these specific 150 files kept failing. I have Speedtest results being gathered from my network and while I can certainly see an impact to the bandwidth available for the Speedtest results during these transactions (you can see the initial upload impact in the green and then the subsequent download impacts in the yellow), it's by no means saturating or causing starvation. Screenshot 2023-07-11 at 7 43 40 PM In fact, the single thread sync wasn't more or less impactful than the other syncs using higher thread concurrency. And you can see I'm regularly hitting almost symmetrical upload and download speeds with my 1Gb connection. I routinely have other network operations (many Python based as b2 is) that come closer to saturating my network than these did and they complete successfully and without retries.

ppolewicz commented 1 year ago

I don't have any access to the server infrastructure, but as far as I understand, the storage unit that a group of your files is being kept might have been undergoing critical fragment reconstruction, during which the performance of that storage unit would be degraded. This is not a problem specific to Backblaze B2 but any storage based on erasure coding - B2 uses 17+3, so if a couple of drives of the same "stripe" die, it's getting pretty important to try to recover the missing fragments and maybe (this is pure speculation as I've never seen the server code) the restore operation is being prioritized over user traffic, which in your case might have shown as broken downloads.

These things come to an end though - the data is reconstructed and the cluster performance recovers.

In order to confirm my suspicion you can try to download it now and see if it works. My guess is that when you run it today, it'll work fine.

Lusitaniae commented 1 year ago

Given the volume of my downloads, each time I run sync takes about 1h runtime

multiple attempts in a day across different days and the result is b2 cli was "reliably failing" to perform a full sync with success

I've moved the script to use s5cmd and it has been working so far

also from discussing with backblaze team, looks like my account is assigned to the smallest cluster they have in EU region (and I'm the heaviest user there) and previously they asked to slow down traffic as it was causing too much load on their systems

tteggelit commented 1 year ago

@ppolewicz That wouldn't explain why when I deleted the offending files and then re-uploaded them (successfully the first time) with b2 sync they continued to fail on an immediate subsequent b2 sync download with the default thread concurrency. But then immediately succeeded on my very next attempt using --threads 1. If the offending files were on devices that were rebuilding, it's highly unlikely the re-uploaded files would be placed on rebuilding devices and that those devices just happened to complete rebuilding in between my failed attempt and immediately successful next attempt. Again, I don't know the storage device layout and how many storage devices are available for my use, but I'm highly skeptical it's just one. In any case, I seem to have a workaround in using --threads 1. I would suggest making the default be --threads 1 erring on the conservative side because failed downloads cost money. A user can then try to increase that value to achieve more performance at the risk of causing failed downloads (and added cost) if they so choose.

ppolewicz commented 1 year ago

@Lusitaniae a small cluster has a good chance of being impacted by rebuild (which takes some time). It might be that it finished rebuilding just as you were switching from cli to s5cmd and that is why it works now.

@tteggelit default of 10 might not be the best setting, it's just something that was decided years ago. We'll be releasing cli v4 soon, which will allow us to change the defaults and this might be one of the changes that we'll need to do (also we'll change the retry policy to be even more bulletproof). Thank you for your suggestion.

yeonsy commented 6 months ago

I had the same issue as others noted. sync downloaded the vast majority of files but kept hitting incomplete reads for a couple of larger files. Tried --downloadThreads 1 twice and that did NOT work. --threads 1 however worked. I wasn't monitoring network so it's possible I hit a momentary network instability but I figured I'd add some more data to the thread. B2 CLI v3.15 on Ubuntu 22.04 LTS, gigabit fiber connection.

It's not clear why single threading seems to work but perhaps the CLI should try failed parallel downloads again in single threaded mode?

ppolewicz commented 6 months ago

@yeonsy can you please try to reproduce it again with logs enabled?

yeonsy commented 6 months ago

@ppolewicz I have a log output from one of the failed runs but I don't have the time to properly redact it at the moment. If you email me at [email removed] I can mail you the log.

ppolewicz commented 6 months ago

Ok, sent you an email

fluffypony commented 3 months ago

Just wanted to add a data point - I'm moving a Windows Server box from a Hetzner box in Estonia to a new box at a different DC in Prague, and trying to transfer 45TB between the two over SMB is unusable. I've specifically created a new Backblaze account using the EU region for latency and speed, and both boxes are like 1 or 2 hops away from the Backblaze DC.

b2 sync on the Hetzner box is flying, uploading at about 2Gbps sustained. As folders have finished I've kicked off a sync on the new box with 40 threads, which saturates the line at 5Gbps as it downloads. Whilst the initial download gets about 70% - 80% of the files, it seems to fail on a ton of large (4gb+) files. It's impossible to predict and SUPER inconsistent - a 5.73TB folder got to 4.43TB on first pass, and then up to 4.65TB on a subsequent pass.

Both of these machines are high performance servers, in reliable data centers, on 10Gbps unmetered ports. They are directly connected to the Internet with live ipv4 addresses. There is no NAT or firewall or anything interfering here - and the box in Prague is a brand new installation, so there's zero chance something could be awry there. I'm running the latest version of b2 as installed by pip (4.0.1).

I then kicked off a new pass with --threads 1, which is slow-ish - downloading at about 450Mbps, and the few files it downloaded before I ctrl+c'd that seemed to be fine. I'm not mad if it'll take ~5 hours to do this final pass for the last 1TB in this folder, but what happens when I do the last folder (34.3TB)? I don't really want this to take 3 weeks, when this seems like it should be fixable by running many threads but no longer chunking the downloads.

As an extra experiment, I installed 1.3.6 in a python venv, and ran it with 40 threads. Using a version this old is honestly a little nuts, but it works at roughly the same speed (even a little slower) than the ---threads 1 version.

Finally, I ran the current version with 40 threads, but I included --max-download-streams-per-file 1, and I'm back at 5Gbps with NO TruncatedOutput errors.

There is DEFINITELY something wrong with the multiple streams per file downloading, unfortunately. I'm not sure if it's just on large files or what the deal is, but given that 1.3.6 works perfectly, and restricting download streams to 1 works perfectly, it's safe to say that this functionality isn't working as expected.

Don't worry @ppolewicz - as you pointed out, even stable open-source software isn't inherently bug free. I'm confident you'll figure this out, and I for one appreciate all the effort you put into this piece of software!

ppolewicz commented 3 months ago

@fluffypony this can, unfortunately, be caused by a number of reasons and I cannot address it without figuring out what the reason is. Your story gives me important insight, thank you very, very much for taking the time to write it down.

I think the code of ParallelDownloader is correct. At one time I've been able to pull ~40Gbps using a single instance of b2 cli running under a special interpreter, but that was on a low latency network. Ideally the code could figure out what the situation is and it would automatically adjust the thread count and the streams per file, ideally we'd be able to use a (non-existent as of yet) resource on the server to continue a broken download

Now, what happened in your case, is hard to say - I would really love to take a look at the logs of the failing operation to see why it stopped and what the problem was. Is it possible for you to re-trigger the error with logs enabled?

fluffypony commented 3 months ago

Now, what happened in your case, is hard to say - I would really love to take a look at the logs of the failing operation to see why it stopped and what the problem was. Is it possible for you to re-trigger the error with logs enabled?

Sure thing - I have re-run it with --debug-logs, how can I get this to you privately? Should I use the gmail address attached to your git commits?

mjurbanski-reef commented 3 months ago

@fluffypony you can send it to me; and yes, email from the commit author field will be fine

fluffypony commented 3 months ago

Sent! I used 100 threads to try induce it into failing as much as possible😆

mjurbanski-reef commented 3 months ago

Received 38MB of logs. Thank you. I will review them tomorrow.

yury-tokpanov commented 1 month ago

I am using the latest b2 client, and I consistently get problems downloading large files (100GB+) using b2 sync on multiple systems located in different regions.

I tried using --threads 1, but it runs really slow at 20MB/s. Imagine trying downloading a 3TB file with that. And even with a single thread I occasionally get errors.

Any recommendations for alterantives? I'm currently exploring using rclone sync or simply mounting with s3fs and then rsyncing from there.

mjurbanski-reef commented 1 month ago

Please share logs so we can confirm if it is the same bug. Hopefully extra data point will make it easier to debug.

yury-tokpanov commented 1 month ago

I removed locations from the logs.

This happened when trying to sync a folder with two sizes: 970GB and 3.4GB (using default 10 threads). Even a smaller file failed.

As I said, I see this all the time across many clusters. b2 is the only service that I'm consistently encountering issues with when downloading files. There are no issues with uploads.

Setting threads to 1 works most of the time, but it is superslow.

ERROR:b2sdk._internal.sync.action:an exception occurred in a sync action
Traceback (most recent call last):
  File "<...>/b2sdk/_internal/sync/action.py", line 55, in run
    self.do_action(bucket, reporter)
  File "<...>/b2sdk/_internal/sync/action.py", line 329, in do_action
    downloaded_file.save_to(download_path)
  File "<...>/b2sdk/_internal/transfer/inbound/downloaded_file.py", line 282, in save_to
    return self.save(file, allow_seeking=allow_seeking)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<...>/b2sdk/_internal/transfer/inbound/downloaded_file.py", line 236, in save
    self._validate_download(bytes_read, actual_sha1)
  File "<...>/b2sdk/_internal/transfer/inbound/downloaded_file.py", line 177, in _validate_download
    raise TruncatedOutput(bytes_read, self.download_version.content_length)
b2sdk._internal.exception.TruncatedOutput: only 3255188397 of 3616794702 bytes read
b2_download(<...>, 4_z5c2d97bf607f9a4e88c40e17_f225ab09c30dd54d6_d20240711_m205528_c004_v0402008_t0019_u01720731328109, <...>, 1720587476958): TruncatedOutput() only 3255188397 of 3616794702 bytes read  
WARNING:urllib3.connectionpool:Connection pool is full, discarding connection: f004.backblazeb2.com. Connection pool size: 10
dnload <...>
ERROR: Incomplete sync: sync is incomplete
ppolewicz commented 1 month ago

This only happens if an individual part is retried 20 times without success. What you are seeing in the end is sync file failure, which just states that the file couldn't be fully downloaded, but there is at least 20 exceptions prior to that (probably much more) stating the real failure reason.

yury-tokpanov commented 1 month ago

That is the only thing I see in the console. Where can I see the full log?

Also, is it possible to manually control the number of retries, backoff time, etc?

Also, there are no issues with uploading large files. Why would downloading them be so different?

Our clusters are located in different places across North America, and usually we don't have any issues downloading from GCS or AWS S3. But the failure rate downloading with b2 sync is close to 100%, unless we're using --threads 1, which then becomes very slow at 20MB/s and still can fail (though rarely).

Do you have a recommendation on alternative ways we can download from b2 buckets? I've just started exploring rclone or mounting a bucket using s3fs.

ppolewicz commented 1 month ago

@yury-tokpanov see https://github.com/Backblaze/B2_Command_Line_Tool?tab=readme-ov-file#detailed-logs

B2 buckets can have s3 interface compatibility, so you can try that. Let us know if that works, though at this point I'm pretty sure it's a storage device or a network issue, because you are failing with 1 thread and there is hardcoded 20 retries per file. Or maybe it's a bug - if you can find out from the logs what is causing it, we'll fix it - it's just not possible to fix that one without being able to reproduce it or see the logs. I'm sure you understand.

As far as I know, the fastest s3 downloader out there is s5cmd.

mjurbanski-reef commented 1 month ago

@fluffypony @yury-tokpanov I'm happy to say, the new B2 CLI 4.1.0 release fixes the reported problem by properly retrying in case of middle-of-the-data-stream errors; hence it will work correctly even when the network is congested due to multiple concurrent connections.

fluffypony commented 1 month ago

@fluffypony @yury-tokpanov I'm happy to say, the new B2 CLI 4.1.0 release fixes the reported problem by properly retrying in case of middle-of-the-data-stream errors; hence it will work correctly even when the network is congested due to multiple concurrent connections.

Amazing - thank you so much for all your effort with this!

yury-tokpanov commented 1 month ago

@fluffypony @yury-tokpanov I'm happy to say, the new B2 CLI 4.1.0 release fixes the reported problem by properly retrying in case of middle-of-the-data-stream errors; hence it will work correctly even when the network is congested due to multiple concurrent connections.

Thank you! We will going to test it!