Support "collStats" command for Spark compatibility

FoundationDB / fdb-document-layer

A document data model on FoundationDB, implementing MongoDB® wire protocol

Apache License 2.0

208 stars 29 forks source link

collStats command is used across many different tools and frameworks like MongoExpress and Spark. Some Spark jobs wouldn't even start without some stats in collStats. Following is the collStats response format.

{
  "ns" : <string>,
  "count" : <number>,
  "size" : <number>,
  "avgObjSize" : <number>,
  "storageSize" : <number>,
  "capped" : <boolean>,
  "max" : <number>,
  "maxSize" :  <number>,
  "wiredTiger" : {   
  },
  "nindexes" : <number>,         // number of indexes
  "totalIndexSize" : <number>,   // total index size in bytes
  "indexSizes" : {                // size of specific indexes in bytes
          "_id_" : <number>,
          "username" : <number>
  },
  // ...
  "ok" : <number>
}

We don't have to implement all the fields part of this issue.

count - This is important for Spark jobs to work reasonably well. We can use Atomic operations to maintain the count. A trivial implementation would just maintain a single counter, which would generate a hotkey. Considering write hotkeys are not as bad, and also Atomic operations don't cause any conflict ranges, this could be a reasonable immediate solution.

To add more details of the proposed solution for the count:

One solution would be using a special index to track the number of the documents in a collection. This is elegant yet requires more code change.
The other solutions would be adding another field in metadata directory, just like how we track metadata version. This is simple yet hack-ish since it breaks through all abstraction and won't be transactional.

I decided to go with the first approach. As regarding to the potential hotkey, I think it's not as bad as the metadataVersion, which is read every time there comes a query. One is because atomic operations are designed to be hammered with write traffic. The other reason would be that (I assume) the collStats will not be read that often.

FoundationDB / fdb-document-layer

Support "collStats" command for Spark compatibility #45