calvinmetcalf / cabs

Content Addressable Blob Store
6 stars 2 forks source link

switch to sha1 #4

Closed max-mapper closed 10 years ago

max-mapper commented 10 years ago

to further copy git (why not) this switches to 40 character hex sha1s for folders + files

also this switches the event-stream dep with stream-combiner (does the same thing, is less grab-baggy than event-stream)

we should probably make the getHash and getPath functions configurable at some point

calvinmetcalf commented 10 years ago

i did half of this already so I'm going to need to manually bring your changes in, one thing to think about is that git isn't all that good with large files so we may want to switch from a single folder level to a sensible number of folder levels (i.e. less then 35 like we had before, but more then one)

calvinmetcalf commented 10 years ago

manually merged

max-mapper commented 10 years ago

agreed, I think the depth and hash functions should be pluggable

also @dominictarr revealed that he wrote this recently https://github.com/dominictarr/content-addressable-store and pushed it to git yesterday. cabs is pure streams whereas content-addressable-store isn't quite as streamy, but they are similar.

one thing cabs should steal though is https://github.com/dominictarr/content-addressable-store/blob/master/index.js#L77 and then https://github.com/dominictarr/content-addressable-store/blob/master/index.js#L89, which will make sure that corrupted files never get written to the blob folder

calvinmetcalf commented 10 years ago

I'll check that out latter, also we prob want the stream methods to be part of the class with the current read and write methods be shortcuts On Feb 15, 2014 10:08 AM, "Max Ogden" notifications@github.com wrote:

agreed, I think the depth and hash functions should be pluggable

also @dominictarr https://github.com/dominictarr revealed that he wrote this recently https://github.com/dominictarr/content-addressable-storeand pushed it to git yesterday. cabs is pure streams whereas content-addressable-store isn't quite as streamy, but they are similar.

one thing cabs should steal though is https://github.com/dominictarr/content-addressable-store/blob/master/index.js#L77and then https://github.com/dominictarr/content-addressable-store/blob/master/index.js#L89, which will make sure that corrupted files never get written to the blob folder

Reply to this email directly or view it on GitHubhttps://github.com/calvinmetcalf/cabs/pull/4#issuecomment-35158209 .

max-mapper commented 10 years ago

also I ran some basic benchmarks:

for a ~700mb AVI: calculating SHA-1 in node takes 2950ms copying the file in node takes 1704ms but cabs'ing the file takes 30368ms (with current defaults)

the problem is that if I bump up the limit to, say, 1gb, then byte-stream will buffer 1gb which is baaad. maybe we need to rethink our approach here. I'm willing to bet that if we just got rid of the file limit then it would be faster + simplify things a lot more (e.g. store entire blobs as single files)

of course this means that for super huge files we might run into limits, but i'm not as concerned with that as I am the number of files-per-directory. i'd rather have cabs be fast

calvinmetcalf commented 10 years ago

My view is that there are 3 layers, the basic for dealing with blocks, the middle for streaming and the high end for dealing with editing and deleting especially if blocks are shared (not done) . The overhead of chucking makes sense if pieces will be edited, but not if they are going to be static, for large static blobs the upper level could skip the middle level. On Feb 15, 2014 11:19 AM, "Max Ogden" notifications@github.com wrote:

also I ran some basic benchmarks:

for a ~700mb AVI: calculating SHA-1 in node takes 2950ms copying the file in node takes 1704ms but cabs'ing the file takes 30368ms (with current defaults)

the problem is that if I bump up the limit to, say, 1gb, then byte-stream will buffer 1gb which is baaad. maybe we need to rethink our approach here. I'm willing to bet that if we just got rid of the file limit then it would be faster + simplify things a lot more (e.g. store entire blobs as single files)

of course this means that for super huge files we might run into limits, but i'm not as concerned with that as I am the number of files-per-directory. i'd rather have cabs be fast

Reply to this email directly or view it on GitHubhttps://github.com/calvinmetcalf/cabs/pull/4#issuecomment-35160032 .

dominictarr commented 10 years ago

You should definitely have a pluggable hash function. sha1 should not be used in new systems - weaknesses have been found that mean you can generate collisions in 2^52 evaluations (avg). This infeasible currently, but in a few years it won't be.

estimated time to generate sha1 collision: https://www.schneier.com/blog/archives/2012/10/when_will_we_se.html

weakened to 2^52 https://www.schneier.com/blog/archives/2009/06/ever_better_cry.html

If sha1 hadn't been weakened then it would be 2^80 evaluations - to put this in perspective, 2^80 / 2^52 is 262 million times easier. so if it cost 50k to generate 2^52 hashes, if you needed 50k*262m = 13 trillion, that is about 1/4 of the total world gdp.

Using sha1 is acceptable if you need to be backwards compatible with other systems currently in use - but if you are building something that you hope may be in use (or future revisions of it may be in use) in 20 years then you should not use sha1.

sha256 is okay, although it's possible to do a length extension attack, if you know sha256(X) you can calculate sha256(X + foo) even if you don't know what X is. This can be avoided if you use double sha256: sha256(sha256(X))

dominictarr commented 10 years ago

Also, if you want to make a performant blob store, this has some very promising ideas: http://www.youtube.com/watch?v=T4DgxvS9Xho

dominictarr commented 10 years ago

@calvinmetcalf you mention "editing" - what do you mean here? I'm confused, because you can't edit in a content addressable store - because changing the file means that the hash is now different.

calvinmetcalf commented 10 years ago

Right, editing would mean replacing, we originally had sha3xx (blanking on the number this second) sounds like we might want to go back to that as a default but have it be choosable On Feb 15, 2014 5:01 PM, "Dominic Tarr" notifications@github.com wrote:

@calvinmetcalf https://github.com/calvinmetcalf you mention "editing" - what do you mean here? I'm confused, because you can't edit in a content addressable store - because changing the file means that the hash is now different.

Reply to this email directly or view it on GitHubhttps://github.com/calvinmetcalf/cabs/pull/4#issuecomment-35169370 .

dominictarr commented 10 years ago

aha so you mean removing something and adding a new thing? sha256 is good, and the key isn't too long, and there are reasonable implementations in pure js incase you want to run in the browser (if that is a design goal) although, blake2s is better for that.

calvinmetcalf commented 10 years ago

ok guys I made it default to sha256 but since @maxogden wants to focus on performance for his application, made everything configurable including the folder depth (default 3).