Suor / django-cacheops

A slick ORM cache with automatic granular event-driven invalidation.
BSD 3-Clause "New" or "Revised" License
2.12k stars 227 forks source link

cache wrongly set to empty list for get query (when accessing a foreignkey field) #474

Closed saranglakare closed 7 months ago

saranglakare commented 7 months ago

I am observing a random get(..) query returning DoesNotExist for some objects (old objects that always existed in the DB). This is happening when I try to access a field that is a ForeignKey to a model (Vehicle in code below).

On digging deep, I found that Django essentially runs a filter query on the parent, then a count() on it to do a get when I try to access the foreign key field.

So I tried to simply call the filter query from shell and found a surprising result (where 409 is the ID of the object on which I am facing this issue):

In [752]: Vehicle.objects.filter(id=409)
Out[752]: <QuerySet []>

In [753]: Vehicle.objects.filter(id=409).nocache()
Out[753]: <QuerySet [<Vehicle: Vehicle object (409)>]>

On further analysis, I found that Django sets upper limit of 21 to filter query in both get and __repr__ (This is why both the field access and the above filter query are returning same empty list). The query cached by cacheops is thus : Vehicle.objects.filter(id=409)[:21] . Indeed, if I check the cache value for this query, I see:

In [748]: qs1c = Vehicle.objects.filter(id=409)
In [749]: qs1c.query.set_limits(high=21)
In [750]: rr.get(qs1c._cache_key())
Out[750]: b'\x80\x04]\x94.'

(The output value is an empty set)

So my problem is: why and when is the cache set to []?

I found that the cache is set in _fetch_all() function in cacheops/query

self._result_cache = list(self._iterable_class(self))
self._cache_results(cache_key, self._result_cache)

The only possibility seems to be list(self._iterable_class(self)) returning empty. How can that ever happen though?

I still don't know the condition under which this happens - it's random, but it's only happening for that one foreign field access. I haven't observed any other errors with any other model so far.

Can anyone guide how can I further pin-point when this is happening?

My setup: django-cacheops 6.2, Django 3.2, Python 3.11

Suor commented 7 months ago

This query was probably executed at some point and returned empty list, the value was cached. Need to take a look where are you executing this, maybe in tests, maybe against other database, etc

saranglakare commented 7 months ago

This is in production and 50k+ objects are accessed daily from the same table, and the problem is only for a few. The data is definitely there in the DB, so why would the query would ever return empty list? Even if the query times out or has other error, I don't see any catch in _fetch_all() so I am guessing cache won't be set incorrectly?

I am using 24 hrs timeout on the cache, so as soon as cache is cleared, the field access works again.

Suor commented 7 months ago

Then maybe it's a key clash

saranglakare commented 7 months ago

Can we really have 100s of key clashes daily? I doubt that.. and if that's the case, then this can't be used in production at all, whereas we have been using cacheops for many years now! This is the first time we have noticed an issue.

In any case, I checked a second object with same issue and the keys do not match:

In [763]: qs1c._cache_key() == qs3._cache_key()
Out[763]: False
saranglakare commented 7 months ago

Any further ideas @Suor ? If there's a production issue then many may be facing it without even knowing. We accidentally found out!

Suor commented 7 months ago

It doesn't need to be hundreds clashes, it could be one between two popular queries. I've never seen this though or even heard about such occasion.

As I said the more probable cause is either:

  1. Clash by error - say you make the same query against different database (queries do not differ their keys by database by default) or in different environment, i.e. tests or something, maybe with mocked response.
  2. A situation where you receive wrong response to database query occasionally. This might happen when connection pools go awry. E.g I've seen this with pgbouncer, usually happens when you don't close and remove connection from a pool on error.
  3. A database exceptuon slurped on the middle, causing empty response, or database cursor read before Django had a chance at it, Django getting 0 rows.
Suor commented 7 months ago

Also a rather common cause for such issue could be caching slave lagged response.

saranglakare commented 7 months ago

Thanks for the pointers @Suor , here's my thoughts on them:

  1. Clash by error - say you make the same query against different database all queries are on single DB
  2. A situation where you receive wrong response to database query occasionally ok, this is a possibility, however, outside of this model we have not seen any issue anywhere so far. Also, I have always seen exceptions raised whenever there's an issue with a pool connection. Django ORM never returns wrong results - at least that is my experience using it for 10+ years now. I have never encountered unreliability. Pools and connections are managed by Django ORM, we are not doing anything special or custom at all.
  3. A database exception slurped on the middle, causing empty response, or database cursor read before Django had a chance at it, Django getting 0 rows. Any pointers on this? AFAIK database exception would be re-thrown and never return an empty response. I will search for Django returning 0 rows, but like I said above, I have never ever experienced or heard about Django ORM giving wrong results. I am keen to debug this option though.
  4. slave lagged response we are not using any slave. Also, these are old rows are in the DB, so they were 100% there when the queries updated the cache.

To further debug the issue: I am thinking of including the cacheops code in our codebase & logging a warning whenever empty query is response is received. This way, we will know when [] is being written to the cache.

Do you think this makes sense? Any further thoughts? (Really really appreciate your time!). Thanks!

Suor commented 7 months ago

Instrumenting code should help - if you know the specific cache key causing trouble, you might log with stacktraces when that is being written. There are some signals, which might help. Might need more elaborate ones for your case though, this side of cacheops might be complemented.

saranglakare commented 7 months ago

Thank you very much @Suor ! I did add logging to log whenever cacheops was writing [] to the cache. That helped me find out exactly where it was happening. It turns out your guess (1: multiple databases) was absolutely correct! The actual wrong cache write was happening from a query in a totally different codebase. In that codebase, a second database was forcefully being used. Since this was not in the router, I had missed it.

I am wondering if we can make cacheops even more robust to take care of these scenarios? I see that you already have the db_agnostic option to use in configuration. However, the default is True. I recommend this should be set to False. The only additional code I see is in cache_key() where it calls md.update(self.db) to incorporate db name it in the key. Are there any negative repercussions of using db_agnostic as False?

I forgot to explicitly mention: settings db_agnostic=False on the Vehicle model works perfectly! The cache keys are different when queried over the 2 different DBs and hence the results remain accurate!

PS: if anyone faces this issue in future, first thing to try would be to set db_agnostic to False for the particular model.

Suor commented 7 months ago

The reason is to support the most common master-slave scenario by default. The second most common scenario - sharding should also work np, simply because queries to different databases will be different.

So it leaves some exotic scenarios where one wants to use db_agnostic=True.

I agree though that setting it to true by default would have been a safer choice. On the other hand, you are the first one to come with such issue so it looks like this really causing trouble.

saranglakare commented 7 months ago

Did you mix up True and False in your message? It's a bit tricky to read :-)

I would further suggest why even have that option? db should always be included in the cache I think. Under what condition would someone want db_agnostic to be True? Why would queries to 2 different databases be needed to be cached under same key?

Yes, I am the only one reporting this - but it is an important issue that can cause problems but go un-detected for a long time (like it happened for us!). So better to be safe I think.

Suor commented 7 months ago

Because when I read from slave I don't want cache to be different from when I read from master.

Why would you have different data for databases with same schema? Usually either data should be the same or data should be non-overlapping.

saranglakare commented 7 months ago

Ahh ok.

Why would you have different data for databases with same schema?

We have 2 DBs : one with active data and second with archived data. The problem was with foreign key in the archived data not pointing to a valid row in archive DB (the parent table was in active DB only). So whenever archived data was loaded first, it would result in cache set to []. However, if active data was loaded first, the cache would be set correctly and in fact helped show correct data when queried from archived data!

So I guess there's no need to change anything in cacheops! I am closing the issue. Thanks a ton @Suor for the active participation & helping me get to the bottom of this quickly.

Suor commented 7 months ago

You are welcome. Glad this is resolved.