Datafable / epu-index

EPU index
http://www.applieddatamining.com/cms/?q=content/economic-policy-uncertainty-index
1 stars 0 forks source link

Term frequency endpoint is too slow #72

Closed bartaelterman closed 9 years ago

bartaelterman commented 9 years ago

I loaded a local database on my machine with all data from 2000 - 2013 and started a local webservice. I redirected the front end to the local endpoints. Everything works fine, but the word cloud is rendered too slow (about half a minute). We should cache this data further. (e.g. the finest-resolution we will need this data is per day. We could calculate the most frequent terms for each day the moment we calculate the index score for the previous day.

@niconoe would you agree?

niconoe commented 9 years ago

Generally speaking, I prefer to do caching/denormalizing only as a third stage optimization after first doing basic profiling (to get an idea of where the bottleneck really is) and then some less invasive optimizations, like indexes or better use of the ORM for less SQL queries.

But it's true in that case the caching doesn't seem too painful, and there's no gross misuse os the ORM at first look. Another in-the-middle option would be also to compute and store in a field at save time the result of cleaned_text_without_stopwords(). But I don't know how much that would help before trying.

Do you want to implement this ticket or do you want me to do it?

Cheers!

bartaelterman commented 9 years ago

Good points! I will leave the first and second stage optimization to you, so you can append this ticket to your to-do list.

niconoe commented 9 years ago

Progress: memoization has been applied to cleaned_text_without_stopwords(). On my machine, the performance gain is x4, which is good but probably not enough. Next steps:

bartaelterman commented 9 years ago

I can do some profiling to see where the bottleneck is.

niconoe commented 9 years ago

Follow-up, hopefully more readable this time.

What do you think?

niconoe commented 9 years ago

I'm working further on this (just to ensure no parallel work).

bartaelterman commented 9 years ago

Hi Nico, Ok sounds great. I'm not really sure how the Django cache exactly works, but I assume it's pretty clever.

Nevertheless, for 2014 we have about 21.000 articles, for 2015 14.000 so far. I think the per-year query will be pretty common. At least, the front end uses this upon initial page load. So that is definitely not some edge case. Caching values at the day level would drop the number of items to aggregate from > 20.000 to 365. Actually, you could consider our epu_index score per day (the table epu_index_epuindexscore) as a cache per day. We could have the calculate_daily_epu command add the word counts per day to this table as well. Maybe rename that to cache_per_day to make it clear. Indeed, the endpoint would need to be updated too to use this table rather then the articles table.

niconoe commented 9 years ago

Yeah, I took exactly this approach except I created new commands/tables instead of reusing the other ones. I wrote a command to calculate for a specific day, and another for all articles in database (to be used at launch, or after we change stopwords list or EPU cutoff, if ever).

Seems pretty much good, I'll work on the view now.

bartaelterman commented 9 years ago

In case my previous comment is accepted, it would make sense to store the word counts as JSON. Summing the word counts for different articles could be done like:

>> from collections import Counter
>> c = Counter({'a': 2, 'hi': 4, 'there': 3})
>> d = Counter({'a': 9, 'ho': 2, 'there': 2})
>> c + d

Counter({'a': 11, 'there': 5, 'hi': 4, 'ho': 2})

Or maybe postgres has some native way to aggregate json data.

bartaelterman commented 9 years ago

My comment crossed yours. I'll see what you come up with ;-)

niconoe commented 9 years ago

Should we close this ticket now? Sorry for not following your guidelines, it was already half implemented when your message was posted!

bartaelterman commented 9 years ago

Jep, close it. The honour is yours.. :-)