FoundationDB / fdb-document-layer

A document data model on FoundationDB, implementing MongoDB® wire protocol
Apache License 2.0
208 stars 29 forks source link

Support "collStats" command for Spark compatibility #45

Open apkar opened 5 years ago

apkar commented 5 years ago

collStats command is used across many different tools and frameworks like MongoExpress and Spark. Some Spark jobs wouldn't even start without some stats in collStats. Following is the collStats response format.

{
  "ns" : <string>,
  "count" : <number>,
  "size" : <number>,
  "avgObjSize" : <number>,
  "storageSize" : <number>,
  "capped" : <boolean>,
  "max" : <number>,
  "maxSize" :  <number>,
  "wiredTiger" : {   
  },
  "nindexes" : <number>,         // number of indexes
  "totalIndexSize" : <number>,   // total index size in bytes
  "indexSizes" : {                // size of specific indexes in bytes
          "_id_" : <number>,
          "username" : <number>
  },
  // ...
  "ok" : <number>
}

We don't have to implement all the fields part of this issue.

count - This is important for Spark jobs to work reasonably well. We can use Atomic operations to maintain the count. A trivial implementation would just maintain a single counter, which would generate a hotkey. Considering write hotkeys are not as bad, and also Atomic operations don't cause any conflict ranges, this could be a reasonable immediate solution.

dongxinEric commented 5 years ago

To add more details of the proposed solution for the count:

I decided to go with the first approach. As regarding to the potential hotkey, I think it's not as bad as the metadataVersion, which is read every time there comes a query. One is because atomic operations are designed to be hammered with write traffic. The other reason would be that (I assume) the collStats will not be read that often.