calvinmetcalf / cabs

Content Addressable Blob Store
6 stars 2 forks source link

streaming hash support #8

Closed max-mapper closed 10 years ago

max-mapper commented 10 years ago

I think we're gonna need to rethink the hashing implementation. What I wanna do is stream a file in with limit: Infinity, e.g. only create 1 file in the blob store regardless of how big the incoming file is

The way it works right now is that every chunk gets stored as a separate file. To change this we would need to rework the relationship between WriteCabs and Cabs.prototype.write so that they can stream entire files, and update the hash in a streaming way (e.g. https://github.com/dominictarr/content-addressable-store/blob/master/index.js#L7-L16)

calvinmetcalf commented 10 years ago

So like a write single stream, also we could have read do multiple read streams On Feb 16, 2014 3:33 PM, "Max Ogden" notifications@github.com wrote:

I think we're gonna need to rethink the hashing implementation. What I wanna do is stream a file in with limit: Infinity, e.g. only create 1 file in the blob store regardless of how big the incoming file is

The way it works right now is that every chunk gets stored as a separate file. To change this we would need to rework the relationship between WriteCabs and Cabs.prototype.write so that they can stream entire files, and update the hash in a streaming way (e.g. https://github.com/dominictarr/content-addressable-store/blob/master/index.js#L7-L16 )

Reply to this email directly or view it on GitHubhttps://github.com/calvinmetcalf/cabs/issues/8 .

max-mapper commented 10 years ago

oh yea good point.

so if limitis Infinity:

if limit is not Infinity:

calvinmetcalf commented 10 years ago

So for reading we can just always open a read stream with a stream of chunks like now we can just a have concatenated steams.

For writing we can have chucked mode and file mode, chunked mode is what we have now, we can improve it latter, file mode buffers to a temp file and incrementally calculates the hash. On Feb 16, 2014 5:51 PM, "Max Ogden" notifications@github.com wrote:

oh yea good point.

so if limitis Infinity:

  • write entire input stream into one file, bypassing byte-stream etc
  • by default it should write to a temporary file both for consistency and also because we don't know the hash ahead of time
  • for reading it is simpler because you just have to read one file

if limit is not Infinity:

  • I think instead of buffering using byte-stream here we should use fs.createWriteStream and stop when we hit the limit, then open another write stream until the input stream empties
  • maybe we need an 'approximate' setting that will write the entire last chunk to the current fs.createWriteStream stream, e.g. if the limit is 5mb and a 500kb chunk comes in when we've written 4.9MB to the current file already, if 'approximate' is true we should just write the whole 500kb chunk, otherwise if false we should slice the 500kb down to 100kb so that the file is exactly 5mb. this cuts down on unnecessary buffer slices. default on this should probably be false

Reply to this email directly or view it on GitHubhttps://github.com/calvinmetcalf/cabs/issues/8#issuecomment-35219013 .

calvinmetcalf commented 10 years ago

ok so read now streams each of the files that it reads and I added a writeFile method which does the whole buffer to disk thing, I can probably rewrite writeStream in terms of that instead of write

calvinmetcalf commented 10 years ago

writeStream now uses writeFile internally so the whole chunk is now no longer buffered in memory but in a file, meaning a chunk size in the gigabytes should be (theoretically) possible.

calvinmetcalf commented 10 years ago

closing this as we've implemented all of it except for approximate support