couchbaselabs / cbft

*THIS PROJECT HAS MOVED* from couchbaselabs TO: https://github.com/couchbase/cbft -- no further development will be done here on couchbaselabs/cbft
Other
27 stars 5 forks source link

pIndex file generated on disk are close to 10x size of the bucket size in memory #165

Open abhi-bit opened 9 years ago

abhi-bit commented 9 years ago
➜  data  du -sch webnutshell_21809c13557138e3_* | grep total
592M    total
➜  data  /opt/couchbase/bin/cbstats 0:11210 -b webnutshell all | grep -w
mem_used
 mem_used:                           66951456

3 types of documents in this bucket(webnutshell), redacted some confidential customer info:

cluster_blob: https://gist.github.com/abhi-bit/6bbbcac3ff75d20b0e00 node_blob: https://gist.github.com/abhi-bit/a8892159fd684c510fb6 customer_blob: https://gist.github.com/abhi-bit/62882eae79602bcda77c

Also, I've noticed the ratio of cbft index files vs bucket mem_used to grow as bucket dataset size grows. From an earlier deployment experience, I've seen a bucket using ~1G in memory created indexes of size 190GB on disk - I've kick started indexing against that bucket couple of days back, will share numbers once the indexing is complete there.

abhi-bit commented 9 years ago

After I flipped to using goleveldb, index size on disk has dropped to 2x - 3x bucket mem_used. Also with goleveldb indexing is very noticeably faster compared to default boltdb option. It might make sense to have goleveldb as default kvstore(Note: I haven't tested anything else beside boltdb)

mschoch commented 9 years ago

OK, I don't see any arrays. How big is the index? Is it possible to share it with me somehow?

abhi-bit commented 9 years ago

BoltDB based indexes were 592MB in size and levelDB based are 134MB(bucket me_used 64MB). You're asking for raw index files from disk or bucket data?

mschoch commented 9 years ago

Well, with the Bleve index I should be able to reproduce the error and figure out which field in which document it was trying to highlight. I understand it contains some customer sensitive data, so if there is some secure way for me to download it that would be ideal.

abhi-bit commented 9 years ago

Passed details over mail

steveyen commented 9 years ago

Figured this might a good bug to cross-link as it has some (admittedly old) advice: https://github.com/couchbaselabs/cbft/issues/11