TTLabs / EvaporateJS

Javascript library for browser to S3 multipart resumable uploads
1.82k stars 206 forks source link

Can crypotoMd5Method be called more than once for a given chunk? #162

Closed smedstadc closed 8 years ago

smedstadc commented 8 years ago

Can crypotoMd5Method be called more than once for a given chunk if we're on a sad path?

I was thinking of piggybacking on it to build a checksum for an entire file incrementally.

Edit: I should also ask whether I can assume that chunks will be hashed with the method sequentially.

bikeath1337 commented 8 years ago

@smedstadc , the purpose of the method is to do the checksum work for each chunk. That is the feature added in v1.0.0. Calculating the checksum is a CPU-intensive task, so we only do it once.

Why do you want to redo this work? The S3 multipart upload feature already does this, calculating the checksum part by part (cooperatively), which is one of the most important, and rare, features of EvaporateJS.

smedstadc commented 8 years ago

I want to store an MD5 digest of the entire files contents for deduplication. The ETAG that AWS returns for multipart uploads is derived from concatenated part MD5s, which makes it dependent on chunk size and requires you to use their algorithm to reproduce it.

If I can be sure that evaporate.js would hash each chunk in sequence, once and only once. I want to try creating a hasher object and updating it with the data that gets passed into cryptoMd5Method before returning the chunk digest. This would essentially require each chunk to be hashed twice, but I'm betting the user will spend more time waiting for the network than chunk hashing.

I don't think that I can achieve this goal without duplicating work or relying on an undocumented contract, so I may settle on a different method in the end.

bikeath1337 commented 8 years ago

I still am not sure if I understand whether what you want to do is something EvaporateJS already supports or if you are trying to do something else.

Prior to our support for md5 checksums, we relied on the eTag; however, since Evaporate supports the checksums on the uploaded parts, there is no need to compare eTags as AWS will inform us if the checksums do not match.

To be able to achieve that, we calculate the checksum once and store it internally to avoid having to recalculate it if necessary.

All that being said, EvaporateJS should just work as documented: reliable, multi-part uploads with complete source to S3 copy fidelity.