Closed bartaelterman closed 9 years ago
Generally speaking, I prefer to do caching/denormalizing only as a third stage optimization after first doing basic profiling (to get an idea of where the bottleneck really is) and then some less invasive optimizations, like indexes or better use of the ORM for less SQL queries.
But it's true in that case the caching doesn't seem too painful, and there's no gross misuse os the ORM at first look. Another in-the-middle option would be also to compute and store in a field at save time the result of cleaned_text_without_stopwords(). But I don't know how much that would help before trying.
Do you want to implement this ticket or do you want me to do it?
Cheers!
Good points! I will leave the first and second stage optimization to you, so you can append this ticket to your to-do list.
Progress: memoization has been applied to cleaned_text_without_stopwords(). On my machine, the performance gain is x4, which is good but probably not enough. Next steps:
I can do some profiling to see where the bottleneck is.
Follow-up, hopefully more readable this time.
What do you think?
I'm working further on this (just to ensure no parallel work).
Hi Nico, Ok sounds great. I'm not really sure how the Django cache exactly works, but I assume it's pretty clever.
Nevertheless, for 2014 we have about 21.000 articles, for 2015 14.000 so far. I think the per-year query will be pretty common. At least, the front end uses this upon initial page load. So that is definitely not some edge case. Caching values at the day level would drop the number of items to aggregate from > 20.000 to 365. Actually, you could consider our epu_index score per day (the table epu_index_epuindexscore
) as a cache per day. We could have the calculate_daily_epu command add the word counts per day to this table as well. Maybe rename that to cache_per_day
to make it clear. Indeed, the endpoint would need to be updated too to use this table rather then the articles table.
Yeah, I took exactly this approach except I created new commands/tables instead of reusing the other ones. I wrote a command to calculate for a specific day, and another for all articles in database (to be used at launch, or after we change stopwords list or EPU cutoff, if ever).
Seems pretty much good, I'll work on the view now.
In case my previous comment is accepted, it would make sense to store the word counts as JSON. Summing the word counts for different articles could be done like:
>> from collections import Counter
>> c = Counter({'a': 2, 'hi': 4, 'there': 3})
>> d = Counter({'a': 9, 'ho': 2, 'there': 2})
>> c + d
Counter({'a': 11, 'there': 5, 'hi': 4, 'ho': 2})
Or maybe postgres has some native way to aggregate json data.
My comment crossed yours. I'll see what you come up with ;-)
Should we close this ticket now? Sorry for not following your guidelines, it was already half implemented when your message was posted!
Jep, close it. The honour is yours.. :-)
I loaded a local database on my machine with all data from 2000 - 2013 and started a local webservice. I redirected the front end to the local endpoints. Everything works fine, but the word cloud is rendered too slow (about half a minute). We should cache this data further. (e.g. the finest-resolution we will need this data is per day. We could calculate the most frequent terms for each day the moment we calculate the index score for the previous day.
@niconoe would you agree?