Tokutek / mongo

TokuMX is a high-performance, concurrent, compressing, drop-in replacement engine for MongoDB | Issue tracker: https://tokutek.atlassian.net/browse/MX/ |
http://www.tokutek.com/products/tokumx-for-mongodb/
703 stars 97 forks source link

TokuMX database failure after multiple long running MapReduce tasks. #1211

Closed ofeldt closed 9 years ago

ofeldt commented 9 years ago

I have multiple long-running MapReduce tasks over multiple collections aggregating results in dynamically created collections for time-series data.

After finishing the development on those JavaScript Map and Reduce functions (~5 per use case), i had it running successfully without errors for 5~10 days.

10 days or so later, i noticed the MR process had empty collections and looked at the logfiles. The log file informed me about a failure upon renaming the temporary collection (i use nonAtomic: true) to its final name. The error given was:

[conn17] mr failed, removing collection :: caused by :: 10076 rename failed: { errmsg: "exception: E11000 duplicate key error.", code: 11000, ok: 0.0 }

Which looked like an ordinary index-unique-key insert error. My result collection did not have a unique key and just to rule out every error, i dropped all indicies on the input-collection and recreated them.

After restarting the server i still wasn't able to get those MapReduce running again. I tried and dropped all other dynamically created collection and rebooted the server again.

Now the MapReduce worked fine for ~1day, after randomly failing again on "rename" but this time with this error:

[conn901] mr failed, removing collection :: caused by :: 10076 rename failed: { errmsg: "exception: assertion /data/release_build-linux-c-opt/build/src/mongo/db/storage/dictionary.cpp:70", code: 0, ok: 0.0 }

I logged into mongo-shell on the database and wanted to list dbs, but got the same error:

> show dbs
listDatabases failed:{ "errmsg" : "exception: assertion /data/release_build-linux-c-opt/build/src/mongo/db/storage/dictionary.cpp:70", "code" : 0, "ok" : 0} at /data/release_build-linux-c-opt/build/src/mongo/shell/mongo.js:46

Though i could no longer list dbs, i could still "use" one and used the one with my dynamically created collections and started to drop those. While dropping i got the following related error, this time with stacktrace:

[conn4] parser-development.system.namespaces Assertion failure existing == descriptor 
/data/release_build-linux-c-opt/build/src/mongo/db/storage/dictionary.cpp 70 0xb3b123 0x9e6337 0x9a13b1 0x9a4290 0x856c94 0x857411 0x941954 0x942211 0x94eb10 0x94eccc 0x8fb2ab 0x8fb4ed 0x968d3c 0x989099 0x978d59 0x97a938 0x97b1ba 0x8e7b15 0x8ea69b 0x90970c
/usr/local/bin/mongod(_ZN5mongo15printStackTraceERSo+0x23) [0xb3b123]
/usr/local/bin/mongod(_ZN5mongo12verifyFailedEPKcS1_j+0xb7) [0x9e6337]
/usr/local/bin/mongod(_ZN5mongo7storage10Dictionary4openERKNS_10DescriptorEbb+0x201) [0x9a13b1]
/usr/local/bin/mongod(_ZN5mongo7storage10DictionaryC1ERKSsRKNS_7BSONObjERKNS_10DescriptorEbb+0x60) [0x9a4290]
/usr/local/bin/mongod(_ZN5mongo16IndexDetailsBase4openEb+0xe4) [0x856c94]
/usr/local/bin/mongod(_ZN5mongo16IndexDetailsBase4makeERKNS_7BSONObjEb+0x531) [0x857411]
/usr/local/bin/mongod(_ZN5mongo14CollectionBaseC2ERKNS_7BSONObjEPb+0x3a4) [0x941954]
/usr/local/bin/mongod(_ZN5mongo17IndexedCollectionC2ERKNS_7BSONObjEPb+0x11) [0x942211]
/usr/local/bin/mongod(_ZN5mongo10CollectionC2ERKNS_7BSONObjEb+0xa10) [0x94eb10]
/usr/local/bin/mongod(_ZN5mongo10Collection4makeERKNS_7BSONObjEb+0x4c) [0x94eccc]
/usr/local/bin/mongod(_ZN5mongo13CollectionMap7open_nsERKNS_10StringDataEb+0x2fb) [0x8fb2ab]
/usr/local/bin/mongod(_ZN5mongo13CollectionMap13getCollectionERKNS_10StringDataE+0xad) [0x8fb4ed]
/usr/local/bin/mongod(_ZN5mongo8Database8diskSizeERmS1_+0x6c) [0x968d3c]
/usr/local/bin/mongod(_ZN5mongo16CmdListDatabases3runERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x5c9) [0x989099]
/usr/local/bin/mongod(_ZN5mongo12_execCommandEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x39) [0x978d59]
/usr/local/bin/mongod(_ZN5mongo7Command11execCommandEPS0_RNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xd18) [0x97a938]
/usr/local/bin/mongod(_ZN5mongo12_runCommandsEPKcRKNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x3aa) [0x97b1ba]
/usr/local/bin/mongod(_ZN5mongo11runCommandsEPKcRNS_7BSONObjERNS_5CurOpERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x35) [0x8e7b15]
/usr/local/bin/mongod(_ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0x6fb) [0x8ea69b] 
/usr/local/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x6bc) [0x90970c]

I'm unable to drop the remaining collections. I'm unable to run MapReduces. I'm unable to list dbs.

I tried to move the database files to a different location to forcefully remove the collections, but IRC-support said i shouldn't try that. Thus far i might need to drop the entire database with a large amount of collected test-data (which i would like not to).

Help is appreciated. Thanks.

tmcallaghan commented 9 years ago

Please move this request over to our Google Group at https://groups.google.com/forum/#!forum/tokumx-user

jmgamboa commented 9 years ago

Hi I am also running into this issue. I have just been renaming the map reduced collection and works temporarily. Please let me know if you have the correct solution