Weird stale cache issue that I cannot explain

pySilver commented 3 years ago

I'm on cacheops 5.1 and I'm seeing constant issue with one aggregation query which is not invalidated automatically for some reason (at least its unclear what is the cause):

mapping_stats = (
            Merchant.objects.annotate(
                pending_mappings=Count(
                    "product_types",
                    filter=Q(
                        product_types__status=CategoryMapping.STATUS_PENDING,
                        product_types__available=True,
                    ),
                ),
                pending_products=Sum(
                    "product_types__last_import_products",
                    filter=Q(
                        product_types__status=CategoryMapping.STATUS_PENDING,
                        product_types__available=True,
                    ),
                ),
            )
            .filter(pending_mappings__gt=0)
            .aggregate(
                total_pending_mappings=Sum("pending_mappings"),
                total_pending_products=Sum("pending_products"),
                total_pending_merchants=Count("pk"),
            )
        )

Settings for this Merchant model: "merchants.Merchant": {"ops": "all"}, Other cacheops settings:

CACHEOPS_DEGRADE_ON_FAILURE = env.bool("CACHEOPS_DEGRADE_ON_FAILURE", False)
CACHEOPS_LRU = env.bool("CACHEOPS_LRU", False)
CACHEOPS_PREFIX = lambda q: env.str("CACHEOPS_PREFIX", default="")
CACHEOPS_DEFAULTS = {"timeout": 60 * 60 * 24}  # 24 hours

Cacheops is configured against redis cluster (3 sentinels, 1 master & 1 follower)

uWSGI setup: gunicorn config.wsgi:application --bind 0.0.0.0:5000 --chdir=/app --worker-class sync --worker-tmp-dir /dev/shm --log-file=- --forwarded-allow-ips="*"

The issue: while in production this query serves outdated aggregation data until TTL from web process. If I run ./manage.py invalidate all problem goes away. If I ssh to any production app server I can confirm that query serves correct response:

In [1]: mapping_stats = (
   ...:             Merchant.objects.annotate(
   ...:                 pending_mappings=Count(
   ...:                     "product_types",
   ...:                     filter=Q(
   ...:                         product_types__status=CategoryMapping.STATUS_PENDING,
   ...:                         product_types__available=True,
   ...:                     ),
   ...:                 ),
   ...:                 pending_products=Sum(
   ...:                     "product_types__last_import_products",
   ...:                     filter=Q(
   ...:                         product_types__status=CategoryMapping.STATUS_PENDING,
   ...:                         product_types__available=True,
   ...:                     ),
   ...:                 ),
   ...:             )
   ...:             .filter(pending_mappings__gt=0)
   ...:             .aggregate(
   ...:                 total_pending_mappings=Sum("pending_mappings"),
   ...:                 total_pending_products=Sum("pending_products"),
   ...:                 total_pending_merchants=Count("pk"),
   ...:             )
   ...:         )

In [2]: mapping_stats
Out[2]:
{'total_pending_mappings': None,
 'total_pending_products': None,
 'total_pending_merchants': 0}

I'm totally sure there is no other cache layer.

Summary: data served by web worker process is stale, while direct query outputs correct results. for some reason invalidate all solves the issue every time. I'll test .nocache() to verify if problem is not arising when cacheops is skipped.

Suor commented 3 years ago

You can try to use @cached_as() and explicitly set invalidation factors there:

@cached_as(Merchant, ProductType.objects.filter(status=...PENDING), ...)
def get_mapping_stats():
    return Merchant...aggregate()

If that works then issue is somewhere in aggregate handling. If not then it's probably somewhere in setup: redis master/slave, LRU or whatever.

It's weird, however, that direct query outputs a correct result. Are you sure you don't cache and reuse the queryset itself?

P.S. You can check manually what do you have in redis by inspecting queryset._cache_key() and looks in redis under that one. P.P.S. Can you also post a value of queryset._cond_dnfs? It might say whether invalidation factors are inspected correctly.

pySilver commented 3 years ago

@Suor I'm not observing the issue at the moment, but it happens often so I'll have some feedback very soon.

In order to get _cond_dnfs I had to remove aggregation part here are the result:

In [12]: mapping_stats._cond_dnfs
Out[12]: {'merchants_categorymapping': [{}], 'merchants_merchant': [{}]}

But I assume its a different query since it doesn't have group by part. Is there any way to find _cond_dnfs & cache key for aggregation query?

What bothers me the most is the fact that direct query returns correct results. It looks like that running process retrieves stale results for some reason and cacheops is not the issue here.

I did little more investigation and found out that in last event when issue was observed there was short master server downtime, so sentinels started routing connections to the follower. My theory at the moment is this:

master goes down
sentinel elects and reroutes connections to a new master
master goes up and regains control of the cluster
python app did cache connection so it still reads from new master which is now back to being follower
data between master and follower are synchronized and maybe incorrect state is introduced here (so older data takes over newer or smth like that)
problem exists until TTL or /manage.py invalidate all is called (which is practically similar thing)

I need to read more about data sync after downtime in redis now.

Suor commented 3 years ago

To get cond dnfs for aggregate you can use the same code as cacheops uses in QueryMixin.aggregate():

# Apply all aggregates the same way original .aggregate(), but do not perform sql
kwargs = kwargs.copy()
for arg in args:
    kwargs[arg.default_alias] = arg

qs = self._clone()
for (alias, aggregate_expr) in kwargs.items():
    qs.query.add_annotation(aggregate_expr, alias, is_summary=True)

print(qs._cond_dnfs)

Do redis really makes the old master again a master and not a slave of the promoted one? Anyway if an invalidation request was lost due to network issues or invalidation being performed on slave (which python still thinks is master due to connection caching) then you get into state you described - stale data. However, the new query hitting cache should also show you stale results on that occasion.

Another thing you might consider that using redis max-memory might cause stale data unless you say CACHEOPS_LRU = True as described in README. If you are using max-memory and didn't set maxmemory-policy volatile-ttl nor CACHEOPS_LRU = True then it's probable not simply possible..

Suor / django-cacheops

Weird stale cache issue that I cannot explain #392