aheckmann / gridfs-stream

Easily stream files to and from MongoDB
MIT License
615 stars 120 forks source link

gridfs-stream read stream is significantly slower than the driver's GridStore.stream #44

Closed zamnuts closed 9 years ago

zamnuts commented 10 years ago

This issue is only about gridfs-stream's createReadStream method; I haven't tested writing since there is no equivalent driver method available, and writing is faster in my experience than reading with this module (10mb/s sustained write vs <5mb/s diminishing read), although Grid.{put,get} is still faster on all accounts, but doesn't offer near the flexibility of Node.js's stream API.

My findings show that the driver's GridStore.stream method is some 40-80 times faster than gridfs-stream's createReadStream method. I'm testing this with a ~360mb ISO file that I have stored in a test database. I'm hoping someone could shed some light on this performance issue: gridfs-stream's createReadStream is slow.

The script I'm using to benchmark performs 6 "downloads" from the database in series, alternating between the driver and gridfs-stream (3 each). I have ran this test repeatedly and found similar results each time. The node application and the mongodb servers are running on the same machine (my localhost); mongodb is actually a small cluster with 1 mongos, 1 config and 3 shards (no replset), although the fs collections for this test are not sharded. I haven't tested this on any serious clusters or environment yet, just locally. Forgive the slop in the following code, I just threw this together right quick.

var async = require('async'), // https://github.com/caolan/async
    mongodb = require('mongodb'), // https://github.com/mongodb/node-mongodb-native/
    gridstream = require('gridfs-stream'), // https://github.com/aheckmann/gridfs-stream/
    fs = require('fs');

mongodb.MongoClient.connect('mongodb://localhost:27017/test',{},function(err,db) {

    var gridfsStream = function gridfsStream(_id,pathOut,callback) {
        var start = process.hrtime();
        new gridstream(db,mongodb)
            .createReadStream({_id:_id})
            .pipe(fs.createWriteStream(pathOut))
            .once('close',function gfsDone() {
                var diff = process.hrtime(start),
                    time = diff[0]*1e9+diff[1];
                console.log('gridfsStream complete to "%s", total time was %dns.',pathOut,time);
                callback(null,{type:'gridfsStream',nanosec:time,path:pathOut});
            });
    };

    var driverStream = function driverStream(_id,pathOut,callback) {
        var start = process.hrtime();
        new mongodb.GridStore(db,mongodb.ObjectID(_id),'r').open(function(err,gfs) {
            gfs
                .stream()
                .pipe(fs.createWriteStream(pathOut))
                .once('close',function driverDone() {
                    var diff = process.hrtime(start),
                        time = diff[0]*1e9+diff[1];
                    console.log('driverStream complete to "%s", total time was %dns.',pathOut,time);
                    callback(null,{type:'driverStream',nanosec:time,path:pathOut});
                });
        });
    };

    console.log('starting...');
    var totalTime = process.hrtime();
    async.series([
        driverStream.bind(this,'53e4e99ed07aee1c2c4f84b4','drv1.iso'),
        gridfsStream.bind(this,'53e4e99ed07aee1c2c4f84b4','gfs2.iso'),
        driverStream.bind(this,'53e4e99ed07aee1c2c4f84b4','drv3.iso'),
        gridfsStream.bind(this,'53e4e99ed07aee1c2c4f84b4','gfs4.iso'),
        driverStream.bind(this,'53e4e99ed07aee1c2c4f84b4','drv5.iso'),
        gridfsStream.bind(this,'53e4e99ed07aee1c2c4f84b4','gfs6.iso'),
    ],function asyncComplete(err,res) {
        var diff = process.hrtime(totalTime),
            time = diff[0]*1e9+diff[1],
            drvAvg = avg(res,'driverStream'),
            gfsAvg = avg(res,'gridfsStream');
        console.log('done (%dns), drv avg is %dns, gfs avg is %dns.',time,drvAvg,gfsAvg);
        db.close(function(){process.exit();});
    });

    function avg(arr,type) {
        var tmp = arr.map(function(o) {
            return o.type === type ? o.nanosec : null;
        }).filter(Boolean);
        return tmp.reduce(function(p,c,i) {
            return p + c;
        },0) / tmp.length;
    }

});

The output of this script is something like:

C:\dev\tmp\gfsbench>node gfsbench.js
starting...
driverStream complete to "drv1.iso", total time was 3798770184ns.
gridfsStream complete to "gfs2.iso", total time was 294048819355ns.
driverStream complete to "drv3.iso", total time was 4075356774ns.
gridfsStream complete to "gfs4.iso", total time was 49156390473ns.
driverStream complete to "drv5.iso", total time was 3657257146ns.
gridfsStream complete to "gfs6.iso", total time was 276763432028ns.
done (631507755476ns), drv avg is 3843794701ns, gfs avg is 206656213952ns.

And just to be sure, all the md5 checksums look ok:

C:\dev\tmp>fciv gfsbench -md5
//
// File Checksum Integrity Verifier version 2.05.
//
96de4f38a2f07da51831153549c8bd0c gfsbench\drv1.iso
96de4f38a2f07da51831153549c8bd0c gfsbench\drv3.iso
96de4f38a2f07da51831153549c8bd0c gfsbench\drv5.iso
96de4f38a2f07da51831153549c8bd0c gfsbench\gfs2.iso
96de4f38a2f07da51831153549c8bd0c gfsbench\gfs4.iso
96de4f38a2f07da51831153549c8bd0c gfsbench\gfs6.iso

System/Environment:

Reggino commented 10 years ago

@zamnuts thanks for this.

Could you please update to version 0.5.2 and verify if the issue still exists?

yocontra commented 10 years ago

@Reggino 0.5.2 hasnt been published yet FYI

zamnuts commented 9 years ago

@Reggino the issue has been resolved.

I concur with @contra that 0.5.1 is still the published version on NPM.

I pulled down 0.5.3 (from git) and ran the test again against a ~300mb ISO. The read performance between gridfs-stream and GridStore within the driver are comparable:

C:\dev\tmp\gfsbench>node gfsbench.js
starting... grifs-stream version: 0.5.3
driverStream complete to "drv1.iso", total time was 3181498935ns.
gridfsStream complete to "gfs2.iso", total time was 3105269272ns.
driverStream complete to "drv3.iso", total time was 3153099400ns.
gridfsStream complete to "gfs4.iso", total time was 3124445812ns.
driverStream complete to "drv5.iso", total time was 3365673537ns.
gridfsStream complete to "gfs6.iso", total time was 3063945377ns.
done (19002900088ns), drv avg is 3233423957ns, gfs avg is 3097886820ns.

MD5 checksums for validity:

C:\dev\tmp>fciv gfsbench -md5
//
// File Checksum Integrity Verifier version 2.05.
//
65cf8a3b7342eb8493f5cd00eb444ae3 gfsbench\drv1.iso
65cf8a3b7342eb8493f5cd00eb444ae3 gfsbench\drv3.iso
65cf8a3b7342eb8493f5cd00eb444ae3 gfsbench\drv5.iso
65cf8a3b7342eb8493f5cd00eb444ae3 gfsbench\gfs2.iso
65cf8a3b7342eb8493f5cd00eb444ae3 gfsbench\gfs4.iso
65cf8a3b7342eb8493f5cd00eb444ae3 gfsbench\gfs6.iso

Note that I did not regression test, and only tested read performance.

Reggino commented 9 years ago

Nice! Thanks

@aheckmann Could you please publish 0.5.3 to npm

Reggino commented 9 years ago

0.5.3 is out!