Closed saranglakare closed 7 months ago
This query was probably executed at some point and returned empty list, the value was cached. Need to take a look where are you executing this, maybe in tests, maybe against other database, etc
This is in production and 50k+ objects are accessed daily from the same table, and the problem is only for a few. The data is definitely there in the DB, so why would the query would ever return empty list? Even if the query times out or has other error, I don't see any catch in _fetch_all()
so I am guessing cache won't be set incorrectly?
I am using 24 hrs timeout on the cache, so as soon as cache is cleared, the field access works again.
Then maybe it's a key clash
Can we really have 100s of key clashes daily? I doubt that.. and if that's the case, then this can't be used in production at all, whereas we have been using cacheops
for many years now! This is the first time we have noticed an issue.
In any case, I checked a second object with same issue and the keys do not match:
In [763]: qs1c._cache_key() == qs3._cache_key()
Out[763]: False
Any further ideas @Suor ? If there's a production issue then many may be facing it without even knowing. We accidentally found out!
It doesn't need to be hundreds clashes, it could be one between two popular queries. I've never seen this though or even heard about such occasion.
As I said the more probable cause is either:
Also a rather common cause for such issue could be caching slave lagged response.
Thanks for the pointers @Suor , here's my thoughts on them:
Clash by error - say you make the same query against different database
all queries are on single DBA situation where you receive wrong response to database query occasionally
ok, this is a possibility, however, outside of this model we have not seen any issue anywhere so far. Also, I have always seen exceptions raised whenever there's an issue with a pool connection. Django ORM never returns wrong results - at least that is my experience using it for 10+ years now. I have never encountered unreliability. Pools and connections are managed by Django ORM, we are not doing anything special or custom at all.A database exception slurped on the middle, causing empty response, or database cursor read before Django had a chance at it, Django getting 0 rows.
Any pointers on this? AFAIK database exception would be re-thrown and never return an empty response. I will search for Django returning 0 rows, but like I said above, I have never ever experienced or heard about Django ORM giving wrong results. I am keen to debug this option though.slave lagged response
we are not using any slave. Also, these are old rows are in the DB, so they were 100% there when the queries updated the cache.To further debug the issue: I am thinking of including the cacheops
code in our codebase & logging a warning whenever empty query is response is received. This way, we will know when []
is being written to the cache.
Do you think this makes sense? Any further thoughts? (Really really appreciate your time!). Thanks!
Instrumenting code should help - if you know the specific cache key causing trouble, you might log with stacktraces when that is being written. There are some signals, which might help. Might need more elaborate ones for your case though, this side of cacheops might be complemented.
Thank you very much @Suor ! I did add logging to log whenever cacheops was writing []
to the cache. That helped me find out exactly where it was happening. It turns out your guess (1: multiple databases) was absolutely correct! The actual wrong cache write was happening from a query in a totally different codebase. In that codebase, a second database was forcefully being used. Since this was not in the router, I had missed it.
I am wondering if we can make cacheops
even more robust to take care of these scenarios? I see that you already have the db_agnostic
option to use in configuration. However, the default is True
. I recommend this should be set to False
. The only additional code I see is in cache_key()
where it calls md.update(self.db)
to incorporate db name it in the key. Are there any negative repercussions of using db_agnostic
as False
?
I forgot to explicitly mention: settings db_agnostic=False
on the Vehicle
model works perfectly! The cache keys are different when queried over the 2 different DBs and hence the results remain accurate!
PS: if anyone faces this issue in future, first thing to try would be to set db_agnostic
to False
for the particular model.
The reason is to support the most common master-slave scenario by default. The second most common scenario - sharding should also work np, simply because queries to different databases will be different.
So it leaves some exotic scenarios where one wants to use db_agnostic=True
.
I agree though that setting it to true by default would have been a safer choice. On the other hand, you are the first one to come with such issue so it looks like this really causing trouble.
Did you mix up True and False in your message? It's a bit tricky to read :-)
I would further suggest why even have that option? db should always be included in the cache I think. Under what condition would someone want db_agnostic
to be True? Why would queries to 2 different databases be needed to be cached under same key?
Yes, I am the only one reporting this - but it is an important issue that can cause problems but go un-detected for a long time (like it happened for us!). So better to be safe I think.
Because when I read from slave I don't want cache to be different from when I read from master.
Why would you have different data for databases with same schema? Usually either data should be the same or data should be non-overlapping.
Ahh ok.
Why would you have different data for databases with same schema?
We have 2 DBs : one with active data and second with archived data. The problem was with foreign key in the archived data not pointing to a valid row in archive DB (the parent table was in active DB only). So whenever archived data was loaded first, it would result in cache set to []
. However, if active data was loaded first, the cache would be set correctly and in fact helped show correct data when queried from archived data!
So I guess there's no need to change anything in cacheops
! I am closing the issue. Thanks a ton @Suor for the active participation & helping me get to the bottom of this quickly.
You are welcome. Glad this is resolved.
I am observing a random
get(..)
query returningDoesNotExist
for some objects (old objects that always existed in the DB). This is happening when I try to access a field that is aForeignKey
to a model (Vehicle
in code below).On digging deep, I found that Django essentially runs a
filter
query on the parent, then acount()
on it to do aget
when I try to access the foreign key field.So I tried to simply call the filter query from shell and found a surprising result (where
409
is the ID of the object on which I am facing this issue):On further analysis, I found that Django sets upper limit of 21 to filter query in both
get
and__repr__
(This is why both the field access and the above filter query are returning same empty list). The query cached by cacheops is thus :Vehicle.objects.filter(id=409)[:21]
. Indeed, if I check the cache value for this query, I see:(The output value is an empty set)
So my problem is: why and when is the cache set to
[]
?I found that the cache is set in
_fetch_all()
function incacheops/query
The only possibility seems to be
list(self._iterable_class(self))
returning empty. How can that ever happen though?I still don't know the condition under which this happens - it's random, but it's only happening for that one foreign field access. I haven't observed any other errors with any other model so far.
Can anyone guide how can I further pin-point when this is happening?
My setup:
django-cacheops 6.2
,Django 3.2
,Python 3.11