Backblaze / B2_Command_Line_Tool

The command-line tool that gives easy access to all of the capabilities of B2 Cloud Storage
Other
544 stars 126 forks source link

`b2 upload-file` does not calculate SHA1 automatically for large files if it's not been provided by `--sha1`. #539

Open eonil opened 5 years ago

eonil commented 5 years ago

I know this issue has been reported and being declined multiple times for various reasons. But I like to tell more reasons why I need this -- auto-calculation of SHA1 by default.


I think one important thing is missing in older threads.

We upload files to use them later. Having no error on uploading is actually only half-work and not really helpful to archive integrity when I downloaded the file. Integrity scenario should include the case after the file after downloaded.

After downloading, the only way to verify integrity of downloaded file is its SHA1 hash. Missing SHA1 means no way to verify whether the download file is fine or damaged. (please correct me if I'm wrong!) Therefore, I think Backblaze should require SHA1 for all uploaded file. Missing SHA1 should be treated as incomplete upload. (now I started to worry about how Backblaze personal/business backup product deals with integrity of recovered backup.)

IMO, with that in mind, option --sha1 should become an overriding switch rather than an optional attachment. Users want b2 command would calculate SHA1 automatically if they consider "full integrity" and --sha1 overriding has not been provided.

If you do not provide SHA1 (regardless of verified by server or your own provision), your upload is incomplete.


If b2 command line cannot accept behavioral modification, I think you can provide extra switch like --autocalc-sha1.


Or Backblaze can provide access to SHA1 hashes for original uploading segments. (if it exists...) I don't know how files are stored on server-side, but if Backblaze keeps SHA1 hashes for original segments and the range of each segments, it's easy to verify on client side. If this is possible, everything is done simply and beautifully. Nothing is really required more.

ppolewicz commented 5 years ago

sha1 is calculated by b2 CLI by default. You may bypass this behavior and provide the checksum by yourself, but it is not possible to not have the checksum at all.

Checksums of fragments and checksum of the whole file are sort of equivalent, you can use either one to ensure data integrity.

eonil commented 5 years ago

It doesn't seem to calculate SHA1 for large files. After I uploaded a large file without --sha1 parameter, I see "contentSha1": "none", on b2 get-file-info. If it's "by default", it shouldn't be none, isn't it? Therefore, b2 command doesn't seem to calculate or attach SHA1 for large files regardless of reason.

bwbeach commented 5 years ago

The Backblaze APIs require that the metadata for a large file be set when you call b2_start_large_file, which happens before uploading any of the parts. Calculating the SHA1 for the file would require reading the entire file before starting the upload. So I think it could annoy people to do it by default.

Also, the Backblaze APIs hide the part boundaries once the file has been uploaded. (The S3 APIs do the same thing.) This allows the system to restructure the file on the back end as needed. It would make sense, though, to extend B2 to store the uploaded part sizes and checksums for integrity checking on download.

Adding an option to this command-line tool to always compute the SHA1 for the entire file, and set large_file_sha1 in the file info.

@eonil - Would you be interested in making that change?

eonil commented 5 years ago

I am a user and I am more annoyed that your tool does not try to provide full-cycle integrity (local -> remote -> another local) "by default". I don't understand why you think people are going to get annoyed for "slow and safe" by default where they can override it to "fast but unsafe" by providing --sha1 none. Some people who prefer to lose data for transmission speed? What's the point of the speed?

IMO, the "default" should reflect company philosophy, and options for users' demand. Current default of b2 command is very confusing.

Today I tried another uploads with two 1GB files, and I discovered b2 does not attach even with --sha1 xxxxxxx... switch explicitly provided. Uploading finished without error, and it showed me a file ID. I queried b2 get-file-info with the file ID, and it returned "contentSha1": "none" for both files. It seems there're more issues.

I'm not interested patching this codebase or Python. I am going to write my own uploader. Thank you for suggestion. I hope issues are at tool level rather than API level.


By the way, I really hope Backblaze to provide segment based checksums. If you de-couple segment sizes of hashing and uploading, hash computation won't be duplicated. As server verifies checksum of each segment, therefore this eliminates potential from server to have wrong checksum.

Dropbox's "checksum of checksums" method can also be considered though I am not sure how safe this is.

ppolewicz commented 5 years ago

It's a fair point. For small files Local -> Remote -> Local is fully checked, but for large files only Local -> Remote is checked. That is indeed the default behavior and that is also the only currently supported behavior by the CLI. We still have the tcp checksum and due to the usage of https an error would probably trigger a fault in decryption, but if we assume that works, sha1 checksums would not be needed at any point.

There is a couple of problems with using a sha1 checksum for download integrity verification:

However, the B2 backend stores the part checksums (I hope!). If those were exposed, we could successfully verify the integrity of the file upon download. Moreover, in such scenario the download process could be (somewhat) optimized to parallelize hashing and downloading, avoiding an additional read (and a fully optimized transferer implementation which sets the download chunk size to (a 1/N fraction of) the server-side chunk size would also be possible and settable as an option (such strategy improves performance but potentially consumes slightly more transaction tokens than the non-optimized behavior).

B2 should retain the ability to restructure the file internally: the chunk size / amount and the respective checksums could change one day, but that would not really impact the checksum verification as long as the client receives a consistent snapshot of the checksums at all times (even during the restructuring process).

eonil commented 5 years ago

Yup. And downloader also can verify data integrity incrementally. This is important for resource constrained platforms like mobile apps. This can be quite important because multi-GB hashing takes long time and more likely to be interrupted. AFAIK, it's not easy to serialize hasher state, therefore I'd like to hash segments smaller than 64MB. If you care mobile devices, this is an important factor.

Also Backblaze don't have to recalculate checksums as long as they keep original segment checksums even if the segments are restructured. I hope they keep hash segment sizes smaller than 64MB even if they restructure the segments. Because as I said above, dealing with "big" stuffs in mobile is really painful. Bigger involves more pain.

With current big files behavior, the best way to write stable and reliable mobile apps is ignoring B2's big file system and handling segmentation completely client side.

kent2cky commented 3 years ago

I have spent an awful lot of time trying to upload files to b2 using clientside ajax requests (vue-dropzone.js), and even though I supplied the file's valid sha1 checksum, the b2 server still responds with "checksum did not match data received" with status code 400. I've checked and rechecked the checksums with all the tools I have and I'm still not able to trace the source of the error. Its as if something happens to the file while its in transit or something.

I've uploaded the same files using the command line tool and it works fine but when I upload via ajax using the exact same sha1 checksum it doesn't work.

My questions are:

  1. Does b2 even allow file uploads through ajax?
  2. If it does allow uploads via ajax then what am i doing wrong?
  3. Does the files remain valid when uploaded using "X-Bz-Content-Sha1", " do_not_verify". Cause I've tried that only to get invalid files when I downloaded them back.
  4. Are there other things I need to know about uploading files to b2 using ajax requests

Please inspect my ajax codes see if I got anything wrong: `sending(file, xhr, formData) { // This function runs for each file right before they are sent by dropezone. // This is a good opportunity to insert file specific values // in this case the file's upload url, name and auth token let fileName = ''; console.log('this is file type', file.type); if (file.type.includes('image')) { fileName = 'images/${uuid.v1()}.png'; } else if (file.type.includes('video')) { fileName = 'videos/${uuid.v1()}.${file.type.split(' / ')[1]}'; }

        const url = appConfig.serverAddress + '/catalog/submitFiles';
        console.log('this is sha1_hash', this.uploadInfo.sha1_hash);
        // open the xhr request and insert the file's upload url here
        xhr.open('Post', this.uploadInfo.url, true);

        // set b2's mandatory request headers
        // xhr.setRequestHeader(
        //  'Authorization',
        //  'Bearer ' + store.getters.getUserIdToken,
        // );
        xhr.setRequestHeader('Authorization', this.uploadInfo.authorizationToken);
        xhr.setRequestHeader('X-Bz-Content-Sha1', this.uploadInfo.sha1_hash);
        xhr.setRequestHeader('X-Bz-File-Name', fileName);
        xhr.setRequestHeader('Content-Type', 'b2/x-auto');

        formData = new FormData();
        formData.append('files', file);

        // the rest will be handled by dropzones upload pipeline
    }`
ppolewicz commented 3 years ago

Hey, you've posted a comment to a B2 CLI issue, which is written in Python, but you posted javascript code. I can't really help you much, though what I would suggest is to try to upload an empty file and inspect the communication between the server and the browser using browser F12 network tab. If you will make it identical to what b2cli does, then it will be guaranteed to work (server only knows what you tell it, so doesn't know what it is speaking with)