Term frequency endpoint is too slow

bartaelterman commented 9 years ago

I loaded a local database on my machine with all data from 2000 - 2013 and started a local webservice. I redirected the front end to the local endpoints. Everything works fine, but the word cloud is rendered too slow (about half a minute). We should cache this data further. (e.g. the finest-resolution we will need this data is per day. We could calculate the most frequent terms for each day the moment we calculate the index score for the previous day.

@niconoe would you agree?

niconoe commented 9 years ago

Generally speaking, I prefer to do caching/denormalizing only as a third stage optimization after first doing basic profiling (to get an idea of where the bottleneck really is) and then some less invasive optimizations, like indexes or better use of the ORM for less SQL queries.

But it's true in that case the caching doesn't seem too painful, and there's no gross misuse os the ORM at first look. Another in-the-middle option would be also to compute and store in a field at save time the result of cleaned_text_without_stopwords(). But I don't know how much that would help before trying.

Do you want to implement this ticket or do you want me to do it?

Cheers!

bartaelterman commented 9 years ago

Good points! I will leave the first and second stage optimization to you, so you can append this ticket to your to-do list.

niconoe commented 9 years ago

Progress: memoization has been applied to cleaned_text_without_stopwords(). On my machine, the performance gain is x4, which is good but probably not enough. Next steps:

why is the cache sometimes unused (i.e: for all year of 2010)
better code for the memoization (extract to decorator?)
identify the most common queries by the frontend, so we can optimize especially for them.
also memoize/cache at a higher-level (view?)
check if postgres performance on the server can be easily improved.
try with memcache as a cache backend to compare performance with the database backend.

bartaelterman commented 9 years ago

Cache is sometimes unused (for year 2010)? I'm not really following.
Postres performance improvement: not sure. An index on cleaned_text will not work (the field is too long).
We could cache term frequencies per day, or per article (the scrapers pipeline could calculate the terms per article while processing and store these)

I can do some profiling to see where the bottleneck is.

niconoe commented 9 years ago

Follow-up, hopefully more readable this time.

Memcached has been installed on the server and configured within Django. That means that we can safely and with good performance use all django cache features.
I've implemented caching at the Article-level (cleaned_text_without_stopwords). That works great for queries up to a couple of months, but doesn't scale well to "one year" queries like those sent by the frontend. There are too many articles in one year, which means there's still a lot of computations, and the cache gets full before the end of the process.
To me, it seems indeed that for the exact queries we have (only per day or per year) the best granularity would be indeed to cache values at the day level.
That requires updating the logic of the endpoint, since it currently loops over Articles (filtered by date), and not on days.
About Postgres performance, I just wanted to have a quick look at the shared_buffers settings, since on some OS it defaults to a ridiculously low value. All is fine here.
Lesson learned for next time: when implementing the web service in the first place, I should identify early the queries really needed (here, only days or year), consider the data volume and optimize for that from scratch.

What do you think?

niconoe commented 9 years ago

I'm working further on this (just to ensure no parallel work).

bartaelterman commented 9 years ago

Hi Nico, Ok sounds great. I'm not really sure how the Django cache exactly works, but I assume it's pretty clever.

Nevertheless, for 2014 we have about 21.000 articles, for 2015 14.000 so far. I think the per-year query will be pretty common. At least, the front end uses this upon initial page load. So that is definitely not some edge case. Caching values at the day level would drop the number of items to aggregate from > 20.000 to 365. Actually, you could consider our epu_index score per day (the table epu_index_epuindexscore) as a cache per day. We could have the calculate_daily_epu command add the word counts per day to this table as well. Maybe rename that to cache_per_day to make it clear. Indeed, the endpoint would need to be updated too to use this table rather then the articles table.

niconoe commented 9 years ago

Yeah, I took exactly this approach except I created new commands/tables instead of reusing the other ones. I wrote a command to calculate for a specific day, and another for all articles in database (to be used at launch, or after we change stopwords list or EPU cutoff, if ever).

Seems pretty much good, I'll work on the view now.

bartaelterman commented 9 years ago

In case my previous comment is accepted, it would make sense to store the word counts as JSON. Summing the word counts for different articles could be done like:

>> from collections import Counter
>> c = Counter({'a': 2, 'hi': 4, 'there': 3})
>> d = Counter({'a': 9, 'ho': 2, 'there': 2})
>> c + d

Counter({'a': 11, 'there': 5, 'hi': 4, 'ho': 2})

Or maybe postgres has some native way to aggregate json data.

bartaelterman commented 9 years ago

My comment crossed yours. I'll see what you come up with ;-)

niconoe commented 9 years ago

Should we close this ticket now? Sorry for not following your guidelines, it was already half implemented when your message was posted!

bartaelterman commented 9 years ago

Jep, close it. The honour is yours.. :-)

Datafable / epu-index

Term frequency endpoint is too slow #72