Closed max-mapper closed 10 years ago
i did half of this already so I'm going to need to manually bring your changes in, one thing to think about is that git isn't all that good with large files so we may want to switch from a single folder level to a sensible number of folder levels (i.e. less then 35 like we had before, but more then one)
manually merged
agreed, I think the depth and hash functions should be pluggable
also @dominictarr revealed that he wrote this recently https://github.com/dominictarr/content-addressable-store and pushed it to git yesterday. cabs is pure streams whereas content-addressable-store isn't quite as streamy, but they are similar.
one thing cabs should steal though is https://github.com/dominictarr/content-addressable-store/blob/master/index.js#L77 and then https://github.com/dominictarr/content-addressable-store/blob/master/index.js#L89, which will make sure that corrupted files never get written to the blob folder
I'll check that out latter, also we prob want the stream methods to be part of the class with the current read and write methods be shortcuts On Feb 15, 2014 10:08 AM, "Max Ogden" notifications@github.com wrote:
agreed, I think the depth and hash functions should be pluggable
also @dominictarr https://github.com/dominictarr revealed that he wrote this recently https://github.com/dominictarr/content-addressable-storeand pushed it to git yesterday. cabs is pure streams whereas content-addressable-store isn't quite as streamy, but they are similar.
one thing cabs should steal though is https://github.com/dominictarr/content-addressable-store/blob/master/index.js#L77and then https://github.com/dominictarr/content-addressable-store/blob/master/index.js#L89, which will make sure that corrupted files never get written to the blob folder
Reply to this email directly or view it on GitHubhttps://github.com/calvinmetcalf/cabs/pull/4#issuecomment-35158209 .
also I ran some basic benchmarks:
for a ~700mb AVI: calculating SHA-1 in node takes 2950ms copying the file in node takes 1704ms but cabs'ing the file takes 30368ms (with current defaults)
the problem is that if I bump up the limit to, say, 1gb, then byte-stream will buffer 1gb which is baaad. maybe we need to rethink our approach here. I'm willing to bet that if we just got rid of the file limit then it would be faster + simplify things a lot more (e.g. store entire blobs as single files)
of course this means that for super huge files we might run into limits, but i'm not as concerned with that as I am the number of files-per-directory. i'd rather have cabs be fast
My view is that there are 3 layers, the basic for dealing with blocks, the middle for streaming and the high end for dealing with editing and deleting especially if blocks are shared (not done) . The overhead of chucking makes sense if pieces will be edited, but not if they are going to be static, for large static blobs the upper level could skip the middle level. On Feb 15, 2014 11:19 AM, "Max Ogden" notifications@github.com wrote:
also I ran some basic benchmarks:
for a ~700mb AVI: calculating SHA-1 in node takes 2950ms copying the file in node takes 1704ms but cabs'ing the file takes 30368ms (with current defaults)
the problem is that if I bump up the limit to, say, 1gb, then byte-stream will buffer 1gb which is baaad. maybe we need to rethink our approach here. I'm willing to bet that if we just got rid of the file limit then it would be faster + simplify things a lot more (e.g. store entire blobs as single files)
of course this means that for super huge files we might run into limits, but i'm not as concerned with that as I am the number of files-per-directory. i'd rather have cabs be fast
Reply to this email directly or view it on GitHubhttps://github.com/calvinmetcalf/cabs/pull/4#issuecomment-35160032 .
You should definitely have a pluggable hash function. sha1 should not be used in new systems - weaknesses have been found that mean you can generate collisions in 2^52 evaluations (avg). This infeasible currently, but in a few years it won't be.
estimated time to generate sha1 collision: https://www.schneier.com/blog/archives/2012/10/when_will_we_se.html
weakened to 2^52 https://www.schneier.com/blog/archives/2009/06/ever_better_cry.html
If sha1 hadn't been weakened then it would be 2^80 evaluations - to put this in perspective, 2^80 / 2^52 is 262 million times easier. so if it cost 50k to generate 2^52 hashes, if you needed 50k*262m = 13 trillion, that is about 1/4 of the total world gdp.
Using sha1 is acceptable if you need to be backwards compatible with other systems currently in use - but if you are building something that you hope may be in use (or future revisions of it may be in use) in 20 years then you should not use sha1.
sha256 is okay, although it's possible to do a length extension attack,
if you know sha256(X)
you can calculate sha256(X + foo)
even if you don't know what X
is. This can be avoided if you use double sha256: sha256(sha256(X))
Also, if you want to make a performant blob store, this has some very promising ideas: http://www.youtube.com/watch?v=T4DgxvS9Xho
@calvinmetcalf you mention "editing" - what do you mean here? I'm confused, because you can't edit in a content addressable store - because changing the file means that the hash is now different.
Right, editing would mean replacing, we originally had sha3xx (blanking on the number this second) sounds like we might want to go back to that as a default but have it be choosable On Feb 15, 2014 5:01 PM, "Dominic Tarr" notifications@github.com wrote:
@calvinmetcalf https://github.com/calvinmetcalf you mention "editing" - what do you mean here? I'm confused, because you can't edit in a content addressable store - because changing the file means that the hash is now different.
Reply to this email directly or view it on GitHubhttps://github.com/calvinmetcalf/cabs/pull/4#issuecomment-35169370 .
aha so you mean removing something and adding a new thing? sha256 is good, and the key isn't too long, and there are reasonable implementations in pure js incase you want to run in the browser (if that is a design goal) although, blake2s is better for that.
ok guys I made it default to sha256 but since @maxogden wants to focus on performance for his application, made everything configurable including the folder depth (default 3).
to further copy git (why not) this switches to 40 character hex sha1s for folders + files
also this switches the event-stream dep with stream-combiner (does the same thing, is less grab-baggy than event-stream)
we should probably make the
getHash
andgetPath
functions configurable at some point