SamVerschueren / dynongo

MongoDB like syntax for DynamoDB
MIT License
58 stars 16 forks source link

Count is wrong if the data set is huge. #60

Open jspreddy opened 6 years ago

jspreddy commented 6 years ago

I have around 300,000 records in my table. I want to get the count of records in my table. Let's stick to no filters for now. I tried three methods:

Method 1:

Transaction.find().count().exec()
    .then((result) => {
      console.log(result);
    });

// 2683

The result in this query is just a number. Expectation: I expect to get the correct count which is 300000. Actual:: I get a partial count as an integer value: 2683

Method 2:

Transaction.find().count().raw().exec()
    .then((result) => {
      console.log(result);
    });

// 2683

Expectation: I would expect this to return the raw object along with the ScannedCount, Count, LastEvaluatedKey, ... Actual:: I get an integer value: 2683

Method 3:

I see the only way of doing this is by fetching the raw result set as shown in the Pagination example and iterating through and fetching whole of the data and summing up the Count attribute. like so:

let count = 0;
Transaction.find().raw().exec()
    .then((result) => {
      count += result.Count;
      return Transaction.find().startFrom(result.LastEvaluatedKey).raw().exec();
    }).then( result =>{
       // so on...
    });

Doing repeatedly this untill there is no LastEvaluatedKey and summing up the counts will give me accurate count.

BUUUUUTT This is not optimal, as it is fetching the entirety of the data set when I only need a count.

Maybe fix either Method 1 or Method 2?

Gerst20051 commented 5 years ago

I wanted to write a fairly detailed explanation incase people don't understand what is happening.

You can get an estimated count of items in the table by using describe. This could be implemented in this library with an interface like Table.describe().ItemCount. This value gets updated every 6 hours.

The only way to know the exact count of items in the dynamo table is to scan the table. This is a limitation of dynamo not this library. The .count() method is just returning the data it get's from dynamo. Scans/Queries will only return up to 1MB of data. At that point you would need to loop based on the LastEvaluatedKey which could also start to fail if you don't pace it to match the table's read capacity units. If you don't pace the loop dynamo will run out of capacity units and throw either a ThrottlingException or a ProvisionedThroughputExceededException. A read capacity unit of 1 means it can read 1 record per second. It would take 100 minutes just to get the number of records in a table with 300,000 records if the read capacity units was set to 50.

https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_Scan.html

"If the total number of scanned items exceeds the maximum data set size limit of 1 MB, the scan stops and results are returned to the user as a LastEvaluatedKey value to continue the scan in a subsequent operation."

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadWriteCapacityMode.html

"One read capacity unit represents one strongly consistent read per second, or two eventually consistent reads per second."

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Programming.Errors.html#Programming.Errors.RetryAndBackoff https://aws.amazon.com/premiumsupport/knowledge-center/throttled-ddb/

https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_DescribeTable.html https://docs.aws.amazon.com/cli/latest/reference/dynamodb/describe-table.html

SamVerschueren commented 5 years ago

I agree with @Gerst20051, this doesn't seem like we can fix easily. The number you get is the number of items scanned in that query, not from the entire table.

We could implement something like a table.count().exec() thing which does a full table scan, but that feels like a very very expensive operation (depending on the size of the table).

If you really need the current value in real-time, you could attach a stream to the table and keep a counter somewhere in a RecordCount table and increment/decrement based on the events put on the stream.

It's a little bit more work to setup, but if you use the table name as partition key, you can get the number of records in milliseconds, even for tables with millions of records.