Tokutek / mongo

TokuMX is a high-performance, concurrent, compressing, drop-in replacement engine for MongoDB | Issue tracker: https://tokutek.atlassian.net/browse/MX/ |
http://www.tokutek.com/products/tokumx-for-mongodb/
703 stars 97 forks source link

investigate MMS compatibility, maybe shove more info into MMS #510

Open leifwalsh opened 11 years ago

leifwalsh commented 11 years ago
leifwalsh commented 11 years ago

Under "hardware", I see nothing but I'm not running munin

leifwalsh commented 11 years ago

"last ping" looks good, "daily ping" is empty, don't know what that is

leifwalsh commented 11 years ago

haven't tried profile data or logs, I assume profile would work as well as normal profiling works (which I haven't tried yet), and logging should be fine, probably not worth looking in to yet

leifwalsh commented 11 years ago

actually "db storage" appears to show something useful

leifwalsh commented 11 years ago

also need to try this with a replica set and a sharded cluster

leifwalsh commented 11 years ago

"db stats" tab looks mostly fine

leifwalsh commented 11 years ago

looks like all this info comes from 'serverStatus' so we should just add to that whatever we want to display

michaeldauria commented 11 years ago

"page faults" normally shows when mongo has to go to disk to get the data it needs to fulfill a query

leifwalsh commented 11 years ago

For us this would be ft fetches (but we can break it down a little farther than that). 

-- 

Cheers, Leif

On Tue, Sep 10, 2013 at 10:41 PM, michaeldauria notifications@github.com wrote:

"page faults" normally shows when mongo has to go to disk to get the data it needs to fulfill a query

Reply to this email directly or view it on GitHub: https://github.com/Tokutek/mongo/issues/510#issuecomment-24209640

michaeldauria commented 11 years ago

"db storage" is just how much disk space is used on disk, I am sure you have this info.

All the other comments make sense to me.

leifwalsh commented 11 years ago

Yeah, I think I saw db storage working eventually, it just looked like it was blank at first because I had just started up mms

-- 

Cheers, Leif

On Tue, Sep 10, 2013 at 10:47 PM, michaeldauria notifications@github.com wrote:

"db storage" is just how much disk space is used on disk, I am sure you have this info.

All the other comments make sense to me.

Reply to this email directly or view it on GitHub: https://github.com/Tokutek/mongo/issues/510#issuecomment-24209817

byzhang commented 11 years ago

Sometimes the mms is very slow to refresh when tokutek is under about 5k ops (the load on the server is not that huge though) Not sure it's caused by the mms agent or the Tokutek.

leifwalsh commented 11 years ago

How often does MMS typically refresh? How slow is it on such a TokuMX instance?

On Thu, Sep 12, 2013 at 12:05 PM, byzhang notifications@github.com wrote:

Sometimes the mms is very slow to refresh when tokutek is under about 5k ops (the load on the server is not that huge though) Not sure it's caused by the mms agent or the Tokutek.

— Reply to this email directly or view it on GitHubhttps://github.com/Tokutek/mongo/issues/510#issuecomment-24333554 .

Cheers, Leif

zkasheff commented 11 years ago

An unrelated note that should go here. Something noticed: a small discrepancy we've noticed between the internal performance stats reported by TokuMX compared to MongoDB; specifically, the 'start and 'end' values of the 'oplog' section of an MMS agent ping:

...,
"oplog": {
    "start": {"$date": "[ISO timestamp]"},
    "rsStats": { ... },
    "end": {"$date": "[ISO timestamp]"}
},
...

The discrepancy is that MongoDB reports these values as BSON Timestamps instead of BSON Dates. For reference, the start/end values are populated by these two python lines of our freely available MMS agent (blockingStats.py:224):

oplogStats["start"] = localConn[oplog].find( limit=1, sort=[ ( "$natural" , pymongo.ASCENDING ) ], fields={ 'ts' : 1 } )[0]["ts"] oplogStats["end"] = localConn[oplog].find( limit=1, sort=[ ( "$natural" , pymongo.DESCENDING ) ], fields={ 'ts' : 1} )[0]["ts"]

leifwalsh commented 11 years ago

For that, I think either we should change the way we display these types to "seem like" what MMS wants, and just try to maintain the information we're actually presenting (which is just advisory estimates anyway, right?), or modify the MMS agent to interpret these values differently. For option 2, maybe we can package our own agent and make it still work with mms.mongodb.org, or maybe we need mongodb inc.'s help and changes in the webserver? Do you know what MMS does with these values? Would it still be meaningful for us to try to provide something for them?

On Mon, Sep 16, 2013 at 5:43 PM, zkasheff notifications@github.com wrote:

An unrelated note that should go here. Something noticed: a small discrepancy we've noticed between the internal performance stats reported by TokuMX compared to MongoDB; specifically, the 'start and 'end' values of the 'oplog' section of an MMS agent ping:

..., "oplog": { "start": {"$date": "[ISO timestamp]"}, "rsStats": { ... }, "end": {"$date": "[ISO timestamp]"} }, ...

The discrepancy is that MongoDB reports these values as BSON Timestamps instead of BSON Dates. For reference, the start/end values are populated by these two python lines of our freely available MMS agent (blockingStats.py:224):

oplogStats["start"] = localConn[oplog].find( limit=1, sort=[ ( "$natural" , pymongo.ASCENDING ) ], fields={ 'ts' : 1 } )[0]["ts"] oplogStats["end"] = localConn[oplog].find( limit=1, sort=[ ( "$natural" , pymongo.DESCENDING ) ], fields={ 'ts' : 1} )[0]["ts"]

— Reply to this email directly or view it on GitHubhttps://github.com/Tokutek/mongo/issues/510#issuecomment-24547227 .

Cheers, Leif

byzhang commented 11 years ago

It tries to refresh every minute, but sometimes it take couple of minutes. On Sep 16, 2013 2:34 PM, "Leif Walsh" notifications@github.com wrote:

How often does MMS typically refresh? How slow is it on such a TokuMX instance?

On Thu, Sep 12, 2013 at 12:05 PM, byzhang notifications@github.com wrote:

Sometimes the mms is very slow to refresh when tokutek is under about 5k ops (the load on the server is not that huge though) Not sure it's caused by the mms agent or the Tokutek.

— Reply to this email directly or view it on GitHub< https://github.com/Tokutek/mongo/issues/510#issuecomment-24333554> .

Cheers, Leif

— Reply to this email directly or view it on GitHubhttps://github.com/Tokutek/mongo/issues/510#issuecomment-24546638 .

leifwalsh commented 11 years ago

Ok. I don't think I saw this when I tried it with cortisol (which is a very high load). Would you be able to share your workload?

Cheers, Leif

On Mon, Sep 16, 2013 at 7:28 PM, byzhang notifications@github.com wrote:

It tries to refresh every minute, but sometimes it take couple of minutes. On Sep 16, 2013 2:34 PM, "Leif Walsh" notifications@github.com wrote:

How often does MMS typically refresh? How slow is it on such a TokuMX instance?

On Thu, Sep 12, 2013 at 12:05 PM, byzhang notifications@github.com wrote:

Sometimes the mms is very slow to refresh when tokutek is under about 5k ops (the load on the server is not that huge though) Not sure it's caused by the mms agent or the Tokutek.

— Reply to this email directly or view it on GitHub< https://github.com/Tokutek/mongo/issues/510#issuecomment-24333554> .

Cheers, Leif

— Reply to this email directly or view it on GitHubhttps://github.com/Tokutek/mongo/issues/510#issuecomment-24546638 .


Reply to this email directly or view it on GitHub: https://github.com/Tokutek/mongo/issues/510#issuecomment-24553249

byzhang commented 11 years ago

5k updates per seconds, ~10 connections.

ankurcha commented 10 years ago

It might be worth documenting metrics that we should pay attention to. We use diamond and graphite to monitor tokumx. This would help the operational side and help identify issues by creating equivalent dashboards inside graphite if we knew what to watch out for.

leifwalsh commented 10 years ago

@byzhang we believe there were some issues with MMS doing queries for the beginning of the oplog that would take a long time if there were a bunch of deletes from trimming. In 1.4, using a partitioned oplog instead of a trimmer should have fixed this problem completely.

@ankurcha the best stuff to monitor is in db.serverStatus() and is documented in the User's Guide but we'll work on improving the documentation of how to monitor TokuMX effectively and update this ticket.