aheckmann / gridfs-stream

Easily stream files to and from MongoDB
MIT License
615 stars 120 forks source link

Let native stream implementation decide if stream can be paused #47

Closed ceari closed 10 years ago

ceari commented 10 years ago

Bypassing the native stream's pause behaviour can cause it to emit an 'end' event before the last data chunk is emitted. Depending on the stream consumer, this leads to incomplete files being read from GridFS. For example, when the stream is piped to an HTTP response, the response is terminated before the last chunk is sent.

The problem is described in more detail in issue #46 which also contains a link to a script reproducing the problem. The problematic behaviour occurs randomly, probably depending on the order the chunks are read (or partially read) from MongoDB.

yocontra commented 10 years ago

+1 - this module can't be piped to http responses due to this

Reggino commented 10 years ago

Sorry for the late response.

I've adjusted and added your sample test to the unit tests, but it seems the test doesn't fail in my case. Could you add a test (or adjust the test) to confirm the fixed issue?

ceari commented 10 years ago

Thank you for looking into this! I adjusted the testcase to fail in 580f717a2f01fdcf93aa5734b3bd15405fceb3f1 The issue probably only occurs when there are multiple chunks to read.

Reggino commented 10 years ago

Indeed, the issue seems related to larger files with multiple chunks. I've isolated the unit-test a bit more to this, merged your code and created a new release.

Thanks!

@aheckmann could you bump the version in npm ?

riaan53 commented 9 years ago

Hi @ceari @Reggino @aheckmann ,

First of all that you for this awesome package!!

Sorry to comment on a closed PR but it looked like the right place for the question.

Im trying to implement http byte serving capabilities using gridfs-streams for html5 video streaming (https://github.com/vsivsi/meteor-file-collection/pull/27). Im having an issue with the pause/resume functionality. Im unable to handle the flow of the stream because it does not want to pause the stream after a chunk. It looks like I can only pause before the stream begins or after it is done.

It works perfectly to pipe to http response like the intention of this PR but it pipes all the chunks immediately without looking at the flow/capabilities of the writestream. For example when a movie is paused the writestream does not want more data and will emit a drain once it needs more. This can also have an influence on not so good internet connection I think.

This is what im trying to do and what pipe in node > v0.10 more or less also controls:

    var readStream = fs.createReadStream(file_path);
    readStream.on('data', function(data) {
        var flushed = res.write(data);
        // Pause the read stream when the write stream gets saturated
        console.log( 'streaming data', data.length );
        if(!flushed){
            readStream.pause();
        }
    });

    res.on('drain', function() {
        // Resume the read stream when the write stream gets hungry 
        console.log('Stream hungry');
        readStream.resume();
    });

    readStream.on('end', function() {
        res.end();
    });

That pause will never stop the stream when reading the stream from gridfs-streams but works perfectly with the example above when reading the stream from fs.createReadStream. I also tried wrapping the stream with http://nodejs.org/api/stream.html#stream_readable_wrap_stream and it looked promising but did not work as well.

Any ideas on what I can try next? :)

Thank you.

Regards, Riaan

Reggino commented 9 years ago

Hi Riaan,

Could you add a test for his exact behavior? That would make debugging things a lot easier!

Thanks.

ceari commented 9 years ago

Hi Riaan,

Since my change to the pause function let's the GridStore stream implementation decide whether to pause or not, it might be worth to look into what that implementation actually does when pause() is called. In version 1.4.3 of the mongodb node driver (declared as dependency in package.json) the implementation is this: https://github.com/mongodb/node-mongodb-native/blob/V1.4.3/lib/mongodb/gridfs/readstream.js#L168 I suspect the "executing" flag in this function is never actually set to false once piping has started so the stream never reacts to pause() calls.

I would try using more recent mongodb node driver versions with gridfs-stream to check if it was a bug that has been fixed already. It seems in version 2.0 of the driver they even changed the stream implementation entirely.

riaan53 commented 9 years ago

Hi @Reggino @ceari ,

I have added a simple failing test case to illustrate the start of what im trying to achieve. See: https://github.com/riaan53/gridfs-stream/commit/411290aee3cc44937125d8ccac3ff98caa8cbecc

Thank you for the help Daniel - I will have a more detailed look at the stream implementation in mongo and changes in the new driver.

Will let you know what I discover.

Regards, Riaan

vsivsi commented 9 years ago

Please correct me if I'm wrong, but gridfs-stream uses whatever mongodb driver your application passes to it, right? This is relevant to this discussion because meteor-file-collection is currently published using the npm mongodb v1.4.7, and so that should be the version actually being used by @riaan53 in his testing within meteor. I plan to bump it to v1.4.28 with the next release of meteor-file-collection, so I need to keep track of which mongodb driver versions implement what features.

As an aside, are there any plans to upgrade gridfs-stream to implement actual node.js v0.10 style streams? I image that may require changes in the mongodb gridfs code as well... It would be a big plus, since they do a much better job of handling flow control when implemented end-to-end for all sources, sinks and transforms; and they are (almost entirely) backwards compatible with old style streams: http://nodejs.org/api/stream.html#stream_compatibility_with_older_node_versions

riaan53 commented 9 years ago

Hi All,

I have tested with the mongodb driver that gridfs-stream currently is using and im getting the exact same behaviour as the streams in gidfs-stream.

Then I started to read up on the new v2+ mongo driver. It clearly states that they have implemented the nodejs v0.10 style streams with the pipe functionality that controls the flow of the stream like @vsivsi mentioned above.

See pipe method: http://mongodb.github.io/node-mongodb-native/2.0/api/GridStoreStream.html New mongo driver v2+ streams: http://mongodb.github.io/node-mongodb-native/2.0/tutorials/streams/

So what this means is that I will have to use npm mongodb v2.0.14 with gridfs-streams. Like @vsivsi asked - are there any plans and do you want to use the new v2+ mongo driver in gridfs-streams? I will give it a go but think it may be a bit above my pay grade :)

Any help/comments will be appreciated.

Thank you.

Regards, Riaan

vsivsi commented 9 years ago

I'm happy to pitch in and help with this. I'll do some testing with the new driver tonight and report back what I find.

riaan53 commented 9 years ago

Thank you! I will do the same.

vsivsi commented 9 years ago

Initial testing with mongodb@2.0.14 within meteor-file-collection is that it seems to work fine.

I am loading it directly from npm in the meteor package definition, loading it and passing an instance through to gridfs-locking-stream, which in turn passes it on to gridfs-stream.

I'll do a bit more testing on this to make sure it seems stable with very large files etc.

So the question is, what needs to change in gridfs-stream to fully enable node 0.10 streams when used with the mongodb 2.x node driver?

riaan53 commented 9 years ago

Sounds good!

My testing did not go so well but when I tried it with your method I had some promising results.

The first thing I tried was just to run the gridfs-stream package tests with the new driver. Im still unable to get the new mongodb driver working in gridfs-stream for the tests. Getting some errors that im working through. All I did was changed the mongodb dev dependency in the package.json file to "mongodb": "~2.0" and running the tests.

I also tried what you did by just changing it in the meteor package def file - and it works like that! I just did some quick tests and I can immediately see a difference. See my results below by just using the normal pipe method (get the same results with a pause/resume on drain implementation). The only problem im getting now is that im receiving 'end' to soon.

I have uploaded a 51 381 683 bytes mp4 movie with meteor file-collection and im playing it back. This is done with my PR with byte serving capabilities.

Code:

      if stream
         total = 0
         res.writeHead statusCode, headers
         stream.on 'data', (data) ->
           total += data.length
           console.log "streaming data chunk size: ", data.length
           console.log "streaming data total size: ", total

         res.on "drain", ->
           console.log 'Stream hungry - resume'

         stream.pipe(res)
            .on 'close', () ->
               res.end()
            .on 'error', (err) ->
               res.writeHead(500)
               res.end(err)

      else
         res.writeHead(410)
         res.end()

Results with the new mongodb driver (mongodb: '2.0.14'):

I20150123-08:23:55.543(2)? start read: 0
I20150123-08:23:55.602(2)? end read: 51381682
I20150123-08:23:55.602(2)? streaming data chunk size:  2097152
I20150123-08:23:55.602(2)? streaming data total size:  2097152
I20150123-08:23:55.603(2)? streaming data chunk size:  2097152
I20150123-08:23:55.603(2)? streaming data total size:  4194304
I20150123-08:23:55.621(2)? Stream hungry - resume
I20150123-08:23:55.621(2)? streaming data chunk size:  2097152
I20150123-08:23:55.621(2)? streaming data total size:  6291456
I20150123-08:23:59.839(2)? Stream hungry - resume
I20150123-08:23:59.840(2)? streaming data chunk size:  2097152
I20150123-08:23:59.840(2)? streaming data total size:  8388608
I20150123-08:24:05.944(2)? Stream hungry - resume
I20150123-08:24:05.945(2)? streaming data chunk size:  2097152
I20150123-08:24:05.945(2)? streaming data total size:  10485760
I20150123-08:24:13.260(2)? Stream hungry - resume
I20150123-08:24:13.261(2)? streaming data chunk size:  2097152
I20150123-08:24:13.261(2)? streaming data total size:  12582912
I20150123-08:24:19.839(2)? Stream hungry - resume
I20150123-08:24:19.839(2)? streaming data chunk size:  2097152
I20150123-08:24:19.839(2)? streaming data total size:  14680064
I20150123-08:24:25.073(2)? Stream hungry - resume
I20150123-08:24:25.074(2)? streaming data chunk size:  2097152
I20150123-08:24:25.074(2)? streaming data total size:  16777216
I20150123-08:24:31.787(2)? Stream hungry - resume
I20150123-08:24:31.788(2)? streaming data chunk size:  2097152
I20150123-08:24:31.788(2)? streaming data total size:  18874368
I20150123-08:24:39.590(2)? Stream hungry - resume
I20150123-08:24:39.596(2)? streaming data chunk size:  2097152
I20150123-08:24:39.596(2)? streaming data total size:  20971520
I20150123-08:24:48.855(2)? Stream hungry - resume
I20150123-08:24:48.856(2)? streaming data chunk size:  2097152
I20150123-08:24:48.856(2)? streaming data total size:  23068672
I20150123-08:24:56.169(2)? Stream hungry - resume
I20150123-08:24:56.169(2)? streaming data chunk size:  2097152
I20150123-08:24:56.169(2)? streaming data total size:  25165824
I20150123-08:25:00.469(2)? Stream hungry - resume
I20150123-08:25:00.470(2)? streaming data chunk size:  2097152
I20150123-08:25:00.470(2)? streaming data total size:  27262976
I20150123-08:25:07.383(2)? Stream hungry - resume
I20150123-08:25:07.383(2)? streaming data chunk size:  2097152
I20150123-08:25:07.384(2)? streaming data total size:  29360128
I20150123-08:25:14.698(2)? Stream hungry - resume
I20150123-08:25:14.698(2)? streaming data chunk size:  2097152
I20150123-08:25:14.698(2)? streaming data total size:  31457280
W20150123-08:25:20.463(2)? (STDERR) Warning: gridfs-locking-stream received duplicate end/close events for file _id: 3f62b8c686e8938452c020a4

Results with the old mongodb driver (mongodb: '1.4.19'):

I20150123-08:28:42.408(2)? start read: 0
I20150123-08:28:42.414(2)? end read: 51381682
I20150123-08:28:42.428(2)? streaming data chunk size:  2097152
I20150123-08:28:42.429(2)? streaming data total size:  2097152
I20150123-08:28:42.437(2)? streaming data chunk size:  2097152
I20150123-08:28:42.437(2)? streaming data total size:  4194304
I20150123-08:28:42.441(2)? streaming data chunk size:  2097152
I20150123-08:28:42.441(2)? streaming data total size:  6291456
I20150123-08:28:42.448(2)? streaming data chunk size:  2097152
I20150123-08:28:42.448(2)? streaming data total size:  8388608
I20150123-08:28:42.453(2)? streaming data chunk size:  2097152
I20150123-08:28:42.453(2)? streaming data total size:  10485760
I20150123-08:28:42.459(2)? streaming data chunk size:  2097152
I20150123-08:28:42.459(2)? streaming data total size:  12582912
I20150123-08:28:42.464(2)? streaming data chunk size:  2097152
I20150123-08:28:42.464(2)? streaming data total size:  14680064
I20150123-08:28:42.470(2)? streaming data chunk size:  2097152
I20150123-08:28:42.470(2)? streaming data total size:  16777216
I20150123-08:28:42.475(2)? streaming data chunk size:  2097152
I20150123-08:28:42.475(2)? streaming data total size:  18874368
I20150123-08:28:42.479(2)? streaming data chunk size:  2097152
I20150123-08:28:42.480(2)? streaming data total size:  20971520
I20150123-08:28:42.484(2)? streaming data chunk size:  2097152
I20150123-08:28:42.484(2)? streaming data total size:  23068672
I20150123-08:28:42.491(2)? streaming data chunk size:  2097152
I20150123-08:28:42.491(2)? streaming data total size:  25165824
I20150123-08:28:42.503(2)? streaming data chunk size:  2097152
I20150123-08:28:42.504(2)? streaming data total size:  27262976
I20150123-08:28:42.507(2)? streaming data chunk size:  2097152
I20150123-08:28:42.507(2)? streaming data total size:  29360128
I20150123-08:28:42.513(2)? streaming data chunk size:  2097152
I20150123-08:28:42.513(2)? streaming data total size:  31457280
I20150123-08:28:42.517(2)? streaming data chunk size:  2097152
I20150123-08:28:42.518(2)? streaming data total size:  33554432
I20150123-08:28:42.522(2)? streaming data chunk size:  2097152
I20150123-08:28:42.522(2)? streaming data total size:  35651584
I20150123-08:28:42.528(2)? streaming data chunk size:  2097152
I20150123-08:28:42.528(2)? streaming data total size:  37748736
I20150123-08:28:42.532(2)? streaming data chunk size:  2097152
I20150123-08:28:42.532(2)? streaming data total size:  39845888
I20150123-08:28:42.538(2)? streaming data chunk size:  2097152
I20150123-08:28:42.539(2)? streaming data total size:  41943040
I20150123-08:28:42.548(2)? streaming data chunk size:  2097152
I20150123-08:28:42.549(2)? streaming data total size:  44040192
I20150123-08:28:42.556(2)? streaming data chunk size:  2097152
I20150123-08:28:42.556(2)? streaming data total size:  46137344
I20150123-08:28:42.561(2)? streaming data chunk size:  2097152
I20150123-08:28:42.562(2)? streaming data total size:  48234496
I20150123-08:28:42.568(2)? streaming data chunk size:  2097152
I20150123-08:28:42.568(2)? streaming data total size:  50331648
I20150123-08:28:42.576(2)? streaming data chunk size:  1050035
I20150123-08:28:42.577(2)? streaming data total size:  51381683

You can clearly see the difference. The new driver is throttling depending on the writestreams ability and using the old driver just spits out all the data at once.

So questions:

  1. Why I am receiving the 'end' event to soon now in the new driver?
  2. Why am I unable to run the new driver directly in gridfs-streams but it works in the meteor package?

Will continue with this tonight.

Regards, Riaan

vsivsi commented 9 years ago

I've done preliminary work to get a version of gridfs-streams fully working with the new mongodb 2.x driver. It now passes all of the gridfs-streams unit tests. The code is on a branch of my fork of gridfs-steams here: https://github.com/vsivsi/gridfs-stream/tree/mongodb_2.x

It wasn't too much work, some small behavior changes (eg no "w+" mode due to concurrency bugs in gridFS I reported a while back), no BSONPure object, and thrown errors when attempting to open a nonexistent file for reading.

I'm sure it's not perfect, but I think got all of the low hanging fruit covered. Next task is to look hard at all of the flow control stuff in the ReadStream implementation and make sure it still makes sense in context of node v0.10 streams.

Enjoy.

riaan53 commented 9 years ago

Hi @vsivsi ,

Thank you! :)

I will start with some comparison testing read/write streams with the new gridfs-stream on mongodb v2.x vs just the mondodb v2.x gridfs on its own to see if it behaves the same.

Then I will start looking at the flow control in gridfs-stream to see if it complies to node v0.10 streams.

Will update soon with my results.

riaan53 commented 9 years ago

Hi @vsivsi ,

Sorry for taking so long to reply. I am busy with some prototypes and will post my findings soon.

The first problem in gridfs-stream with the new node style streams is - GridReadStream and GridWriteStream inherits from Stream and that needs to change to Readable and Writable and implement the _write and _read methods as in http://nodejs.org/api/stream.html#stream_api_for_stream_implementors. But I have another idea - which is to just create a http://nodejs.org/api/stream.html#stream_class_stream_passthrough stream so that gridfs-streams do not re-implement the stream from a stream and just pass it through to the mongodb gridfs streams directly. This will give as all the goodness of gridfs-stream and the new node style streams without re-inventing the wheel. This might have to change into a new package all together as I do not think they will pull the changes in?

Any comments welcome, and will post my results/findings soon.

Regards, Riaan

vsivsi commented 9 years ago

@riaan53 No worries, I've got plenty of other things to work on. :smile: I would like to push a new update to file-collection by the end of this week if possible though.

One comment on gridfs-stream: It seems that with the new mongodb 2.x driver functionality, the "read" side of gridfs-stream is now mostly unnecessary. However, the mongodb gridFS driver doesn't implement any write-stream functionality, so that part of gridfs-stream is still useful. If we need to fork because the gridfs-stream guys aren't ready to support the mongodb 2.x functionality yet, then we can just pull the modified code into gridfs-locking-stream and remove the dependency it has on gridfs-stream.

Reggino commented 9 years ago

@riaan53 Of course you're free to start a new package, but I'ld be happy to pull in changes for a version that is compliant to 0.10-streams even if it requires fundamental changes in the way things work. We can always bump the major version.

riaan53 commented 9 years ago

@Reggino I would prefer to pull it into this package if possible, and I will probably need your help as well :)

riaan53 commented 9 years ago

@vsivsi @Reggino - I will commit a workable solution in a day or so. Will not be 100% complete with all the edge cases and so but just want some early feedback, in case im approaching this incorrectly.

riaan53 commented 9 years ago

Im basically ready to commit tonight - just want to test the solution in https://github.com/vsivsi/meteor-file-collection with gridfs-locking-streams.

What is the best way to do this? @vsivsi Can I first create a PR to the fork you made of gridfs-stream and then if we are happy with everything do a PR to a new branch here?

@Reggino - Will you create a branch for us to do the PR against?

There has been some major changes and a lot of testing is still needed but from initial testing it looks good for flow in and out of gridfs for 10GB files with less than 250mb memory used - thanks to the new node stream. I will explain everything in the PR tonight. This fixes https://github.com/aheckmann/gridfs-stream/issues/58 as well.

Regards, Riaan

Reggino commented 9 years ago

Sounds good :+1:

I've just added a branch called 1.0.0 for this purpose.

vsivsi commented 9 years ago

@riaan53 No need to maintain a bunch of forks here. You can just check-in to your fork at https://github.com/riaan53/gridfs-stream and create a PR from there. You're into this way deeper than I am at this point, so do whatever is easiest for yourself.

So long as you haven't changed the gridfs-stream API, gridfs-locking-stream should "just work", with the caveat that now that I've uttered those words, I'm probably doomed to eat them.

riaan53 commented 9 years ago

@vsivsi ok cool! I have done an initial commit (not yet ready for PR! still needs testing) to my fork at https://github.com/riaan53/gridfs-stream/tree/mongodb_2.x . All tests passes and I have removed 3 now invalid tests and added 2 more. Im still busy with testing it with file-collection but theres something not working nicely :( It does not look like its re-assembling the chunks from the resumable upload or something - all the chucks are there. Still trying to work through it. All file-collection tests also passes. Will be great if you can review/try it out and let me know what is causing this - assume there will be some issues :) . Remember to remove the BSONPure object in grifs-locking-streams index.js (two places), and to update mongodb dev dependency for tests.

Basically what I have changed in gridfs-stream is to inherit (for writstream and readstream) from http://nodejs.org/api/stream.html#stream_class_stream_passthrough and then just piping to the open gridstore.stream() for writestream and from gridstore.stream() for readstream as well. See streams at http://mongodb.github.io/node-mongodb-native/2.0/api-docs/

The public api should be the same as can be seen with the passing tests.

Got two questions:

  1. w+ is gone in new mongodb driver - will this affect anything in file-collection?
  2. In meteor-file-collection when I stream a movie (playback) the read lock will timeout after 90 seconds and destroy the readstream and then end the movie. How must I handle this? Renew the readlock on each new data chunk maybe? But this will still cause problems when someone pause the movie (readstream). Maybe have an option not to destroy the stream on lock timeout?

Below you can see an example of the new node style streams flow control (pipe) for write and read in gridfs-stream now with my changes.

Writestream:

OK! Sending chuck size: 65536
Total: 65536
gridstore writestream hungry send more data!
OK! Sending chuck size: 65536
Total: 131072
writestream open (The store is only open now, previous data in stream buffers so its ok. Writestream saturated so it wont send anymore if store takes longer to open)
OK! Sending chuck size: 65536
Total: 196608
gridstore writestream hungry send more data!
gridstore writestream hungry send more data!
OK! Sending chuck size: 65536
Total: 262144
gridstore writestream hungry send more data!
OK! Sending chuck size: 65536
Total: 327680
gridstore writestream hungry send more data!
OK! Sending chuck size: 65536
Total: 393216
OK! Sending chuck size: 65536
Total: 458752
gridstore writestream hungry send more data!
gridstore writestream hungry send more data!
OK! Sending chuck size: 65536
Total: 524288
gridstore writestream hungry send more data!
OK! Sending chuck size: 65536
Total: 589824
gridstore writestream hungry send more data!
OK! Sending chuck size: 65536
Total: 655360
OK! Sending chuck size: 65536
Total: 720896
gridstore writestream hungry send more data!
gridstore writestream hungry send more data!
OK! Sending chuck size: 65536
Total: 786432
gridstore writestream hungry send more data!
OK! Sending chuck size: 65536
Total: 851968
gridstore writestream hungry send more data!
OK! Sending chuck size: 65536
Total: 917504
OK! Sending chuck size: 65536
Total: 983040
gridstore writestream hungry send more data!
gridstore writestream hungry send more data!
OK! Sending chuck size: 65536
Total: 1048576
gridstore writestream hungry send more data!
writestream finish
writestream end
writestream close with file data
{ _id: 54ccae66be8eeb451e6db993,
  filename: 'test.tmp',
  contentType: 'binary/octet-stream',
  length: 1048576,
  chunkSize: 261120,
  uploadDate: Sat Jan 31 2015 12:28:54 GMT+0200 (SAST),
  aliases: undefined,
  metadata: undefined,
  md5: 'b6d81b360a5672d80c27430f39153e2c' }

Readstream:

readstream open from app
OK! Sending chuck size: 261120
Total: 261120
fs.createWriteStream hungry send more data!
OK! Sending chuck size: 261120
Total: 522240
fs.createWriteStream hungry send more data!
OK! Sending chuck size: 261120
Total: 783360
fs.createWriteStream hungry send more data!
OK! Sending chuck size: 261120
Total: 1044480
fs.createWriteStream hungry send more data!
OK! Sending chuck size: 4096
Total: 1048576
readstream end from app
readstream close from app
vsivsi commented 9 years ago

@riaan53 Congrats on what you have done so far, sounds like it's close!

A's to your Q's: 1) Yes, a couple of the server side calls support w+ mode, and should be changed to prohibit it and throw if use is attempted. 2) You can configure default lock behavior for your FileCollection when you create it. You can set locks to never expire (not recommended), or something really conservative like 24 hours... See: https://github.com/vsivsi/meteor-file-collection#create-a-new-filecollection-object---server-and-client

Ultimately, you need to decide what the correct behavior is for your application. E.g. what should happen if one user is streaming a file (for a looooong time) and another user/admin tries to remove that same file? The locking scheme will prevent the removal if the file is being read, but you need to figure out what "being read" means by either setting an appropriate lockExpiration time (e.g. 6 hours), or alternatively, setting it to something shorter (like 15 minutes) and then renewing it each time there is user activity (like reading another chunk). Bascially, if the user presses pause and walks away, how long should the file remain ready to stream before requiring them to reinitialize the streaming session.

As for the resumable upload chunk issue, are you seeing the temporary chunks get moved to the correct data file? There is one somewhat hacky step in that process that may no longer work with the mongodb 2.x native driver 'w+' change (and/or the changes you've made to gridfs-stream). It is here: https://github.com/vsivsi/meteor-file-collection/blob/master/resumable_server.coffee#L78

What this call does is open the newly uploaded file for writing (after all of the data chunks have been moved into place and the filesize changed), and then write a "" (zero length string) to the write stream before closing it. This may seem strange, but what it accomplished was causing the mongodb node.js driver to trigger the MongoDB server-side recalculation of the md5 sum for all of the new chunks in the now complete file.

That behavior may have changed in the new driver (probably did), and if so, I'll need to dig into the mongodb driver code to figure out how to directly trigger the MongoDB server to complete that calculation by sending it the correct low level command directly.

vsivsi commented 9 years ago

Here's the "non-hacky" way to trigger the server-side md5 sum recalculation on the chunks of a gridfs file:

Use the db.command() call: http://mongodb.github.io/node-mongodb-native/2.0/api/Admin.html#command

To issue a 'filemd5' admin command: http://docs.mongodb.org/manual/reference/command/filemd5/#dbcmd.filemd5

And write the result to the file record, as is done in the gridFS driver here: https://github.com/mongodb/node-mongodb-native/blob/2.0/lib/gridfs/grid_store.js#L959

Reggino commented 9 years ago

Looks awesome, congrats with this stuff!

Please review line 122 here: https://github.com/riaan53/gridfs-stream/blob/mongodb_2.x/lib/readstream.js#L122 . When changing this to gsReadStream all unit tests seem to succeed :+1:

riaan53 commented 9 years ago

@Reggino thank you for that! Dont know why it makes a differences but it does - I thought that this is gsReadStream in that context.

@vsivsi I can confirm that that was the issue with the file upload - will change the md5 re-calculation to the "non-hacky" way. I see now from your excellent documentation that you have full control over the locks - thank you for that! :)

Over-all it works great in File-collection but theres just one last issue im working through. Readstreams end to soon when watching a movie (end before last chunk), downloading full files and so are fine. Its just at the reduced rate that theres something not 100%. Will update you later!

riaan53 commented 9 years ago

Hi @Reggino @vsivsi

Hope you are doing well. Im trying a new approach with the readstream. Im doing something similar to http://nodejs.org/api/stream.html#stream_readable_push_chunk_encoding and will let you know the outcome soon (tonight/tomorrow) after some thorough testing :)

The problem with my previous approach (i think..) is the stream implementation from gridstore pushes to large chunk (cant seem to set it?) into the new passthrough stream each time which over saturates the stream immediately and this influence the flow. Im now just using the read method from gridstore and implementing the _read method from the new node Readable streams.

The passthrough stream for the writestream implementation works nicely because we are just piping it through from one end to the other. And the source also uses smaller chunks.

Regards, Riaan

Reggino commented 9 years ago

Sounds like a proper way to do it! I'll review the actual code whenever you are ready :-)

riaan53 commented 9 years ago

Thank you! :)

vsivsi commented 9 years ago

Sounds like great work @riaan53, thanks again for taking this on.

vsivsi commented 9 years ago

Hi @riaan53, does the gridstore readstream implementation push an entire gridfs chunk at a time? Because that is settable. Ideally, for efficiency the stream implementations would be designed to deal in units of gridfs chunkSize. That way a single record of the chunk collection is hit for each read/write and it can be easily configured on a per file basis when the file is created.

riaan53 commented 9 years ago

Hi @vsivsi @Reggino ,

@vsivsi you are 100% correct, it uses the chunkSize as the streaming chunk size.

I have tried a few ways of approaching this, passthrough streams, using the native gridstore stream implementation for write and read and then implementing 'n new stream with _read and _write.

The one that worked the best was the new stream implementation, speed and reliability wise. So the writestream uses this - http://nodejs.org/api/stream.html#stream_class_stream_writable_1 and readstream - http://nodejs.org/api/stream.html#stream_readable_read_size_1 .

Im still busy testing with meteor-file-collection but can you please check if there is something I missed or that you want me to change? I really dont mind - I want this to work 100%! https://github.com/riaan53/gridfs-stream/tree/mongodb_2.x

Here is a little demo (temporary link for future visitors) of the end result, upload (drag and drop) a web safe video that your browser support (mp4/webm chrome, mp4 safari with correct codecs) and then you can use the seek-bar like you wish. With every new request on the seek-bar a range request to the gridstore will be made if it wasnt cached already. Our own mini youtube from mongodb :)

Still needs some testing and the md5 recalculation on meteor-file-collection.

Just a note: There are nothing wrong with the new mongodb streams in the version2 driver and they have some impressive speeds for uploading the same file twice to the same collection and filename. Think they check the md5 for the chunk before uploading, but havnt looked into this. But other than that our implementation is comparable with speed and resources used.

Thanks!

Regards, Riaan

riaan53 commented 9 years ago

Sorry I see the demo app link is down - will put it back up tonight.

riaan53 commented 9 years ago

You can view the demo here: link removed please ask if you want a demo back up again

If you upload a new video you might have to refresh the page - the upload done check for this demo is done client side only and there is some server processing necessary to re-assemble the chunks after a upload. So if it request the range to soon it will be invalid but hey its only a demo :)

Let me know what you think.

Reggino commented 9 years ago

Whoa! Nice! :+1: Respect. Really cool stuff and a nice, snappy demo.

When i was looking at your code the last time i had the same idea about using that native gridstore stream, but when i was trying to get predefined size-chunks from that stream i went from issue to issue. I think this is a really nice approach and the code looks good.

If your benchmark code is available for publishing, that would be nice too... :-D.

Anyway, I'd be happy to merge in a PR.

Very nice, congrats!

vsivsi commented 9 years ago

Agreed, it looks great! Let me know when you're ready to start merging. Three projects will need to come into sync on this, and once it's committed somewhere here, I can do more integration testing of my own for grifs-locking-stream and meteor-file-collection.

vsivsi commented 9 years ago

Oh, and thanks, now I have that damn TS song stuck in my head... :8ball:

riaan53 commented 9 years ago

@vsivsi hehe sorry about that :)

Thank you @vsivsi and @Reggino !

PR is in and you can find the comparison results with the native mongodb streams here: https://github.com/riaan53/gridfs-stream-vs-native

I think it will be better to run the comparison in isolation in separate scripts from a clean db each time if you want a really accurate result. This was just done quickly on my pc while downloading and surfing.. :)

@vsivsi I need to update 2 small changes to https://github.com/vsivsi/meteor-file-collection/pull/27 - will do it tomorrow night but you can test with the current one so long. And I need to do the md5 recalculation still.