cloudfoundry-attic / cf-abacus

CF usage metering and aggregation
Apache License 2.0
98 stars 86 forks source link

Inconsistent data owing to non transactional writes to database. #469

Open rajkiranrbala opened 7 years ago

rajkiranrbala commented 7 years ago

The applications in the pipeline that reduces usage, generates multiple output docs. One of the output docs is the duplicate detection doc and the others are the accumulated ones. With multiple DB partitions, the partition to which the every doc is written is based on the id of the doc. If one of these writes fail, the docs that written to the DB will cause the entity to be in inconsistent state.

For example, if we start the pipeline with the following configuration

export SAMPLING=86400000
export SLACK=5D
export DB_PARTITIONS=4
npm start

and submit the usage for November 30th on December 3d

{
  "start": 1480464000000,
  "end": 1480464000000,
  "organization_id": "us-south:a3d7fe4d-3cb1-4cc3-a831-ffe98e20cf27",
  "space_id": "aaeae239-f3f8-483c-9dd0-de5d41c38b6a",
  "consumer_id": "app:bbeae239-f3f8-483c-9dd0-de6781c38bab",
  "resource_id": "object-storage",
  "plan_id": "basic",
  "resource_instance_id": "0b39fa70-a65f-4183-bae8-385633ca5c87",
  "measured_usage": [
    {
      "measure": "storage",
      "quantity": 1073741824
    },
    {
      "measure": "light_api_calls",
      "quantity": 1000
    },
    {
      "measure": "heavy_api_calls",
      "quantity": 100
    }
  ]
}

the aggregator will produce 3 documents with the following ids

  1. k/us-south:a3d7fe4d-3cb1-4cc3-a831-ffe98e20cf27/t/0001480723200000 (written to database abacus-aggregator-aggregated-usage-2-201612)

  2. k/us-south:a3d7fe4d-3cb1-4cc3-a831-ffe98e20cf27/aaeae239-f3f8-483c-9dd0-de5d41c38b6a/app:bbeae239-f3f8-483c-9dd0-de6781c38bab/t/0001480723200000 (written to database abacus-aggregator-aggregated-usage-3-201612)

  3. k/us-south:a3d7fe4d-3cb1-4cc3-a831-ffe98e20cf27/0b39fa70-a65f-4183-bae8-385633ca5c87/app:bbeae239-f3f8-483c-9dd0-de6781c38bab/basic/basic-object-storage/object-rating-plan/object-pricing-basic/t/0001480464000000/0001480464000000 (written to database abacus-aggregator-aggregated-usage-3-201611)

If CouchDB is used as the backend database, the writes happen in 3 different HTTP requests. If some writes are successful and others aren't then this will result in inconsistent accumulated values.

cf-gitbot commented 7 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/135472301

The labels on this github issue will be updated when the story is started.

hsiliev commented 7 years ago

Do we want to maintain a consistent set of docs in transactional manner or just be sure that the reporting does not fetch something that's inconsistent? Is it just reporting or sink extensions as well?

What are the problems that such inconsistency causes? I can imagine incorrect report, but are there some side effects as well?

rajkiranrbala commented 7 years ago

@hsiliev The inconsistency will result in incorrectly aggregated values. The side effects depend on the resource.

Let's say the accumulator posted an accumulated document, for a runtime event, to the aggregator;

  1. If the duplicate detection doc was not written while org and consumer aggregated documents were written, the caller is returned with an error code lets say 500, the accumulator tries to report the accumulated usage. Now this usage has already been already aggregated at org and consumer levels and still the usage will not be rejected because of the missing duplicate detection document, and it will be aggregated at the org and consumer level again. If runtime start event was submitted, it would double the consumption or if it's a runtime stop event, then it would show a decreased consumption value (sometimes negative).

  2. If the duplicate detection succeeded and the org or consumer aggregation fails, then the caller will retry. Now this document will be rejected as a duplicate detection document, though it's not aggregated at org or consumer levels.

hsiliev commented 7 years ago

Handling transactions with Couch or Mongo does not seem like a good idea. What if we split the aggregator into consumer and org micro-services?

If consumer aggregation fails we won't have org aggregation, but this will be more consistent than today's behaviour. In worst scenario it should be the same as case 1 above, but it can also happen that we face problems only on consumer level and not on org level (or vice versa).

Drawback is of course longer pipeline, that needs to be async and probably with replay running regularly.

nmaslarski commented 7 years ago

Hello, We have managed to reproduce the database inconsistency on mongodb by sending the same usage (same timestamp and etc) few times asynchronously. Usually one of the requests returns 201 CREATED and the others return we get Status code 409 and some of them contain error E11000 duplicate key error index from the db.

After this error we get inconsistent behavior. The doc with new usage is usually correctly written in the db, but when getting the org usage, sometimes we get the correct usage, sometimes we get HTTP/1.1 500 Internal Server Error, we've also monitored double aggregation if we send a lot of parallel requests.