digitalmethodsinitiative / dmi-tcat

Digital Methods Initiative - Twitter Capture and Analysis Toolset
Apache License 2.0
366 stars 114 forks source link

Performance / Analysis page #360

Closed marciofoz closed 5 years ago

marciofoz commented 5 years ago

Hi! Our environment is as follows: -dmi-tcat running in docker container (Fedora); -dedicated database server (MariaDB 10.1);

-Bins: 42 -Database size: 53G (in few months !) -all tables on dmi-tcat is optimized; -45 Millions of tweets;

Problem: -when switch from admin page to analisys page, we have a long long time until the page is shown; I don't know the tasks that dmi-tcat run in this moment, but I see my database server increase processing;

Solution: -I must increase all the timeout options in php.ini and nginx to avoid the gateway 504 error (timeout);

Question: -can this behavior be originated to show the total number of tweets? ( in my case: 45.252.040 tweets archived so far (and counting));

Suggestion: -can the implementation of this page be changed so that no pre-processing is done in the database ?

Best regards.

marciofoz commented 5 years ago

Hi !!
Some new about this issue ? Tks

dentoir commented 5 years ago

Hi @marciofoz

With regards to the analysis page, all metrics are calculated live there (number of tweets per bin, number of distinct users in selected datarange etc. etc). The approaches to speeding this up are either 1) improved query design, 2) faster storage engine (now supported in latest TCAT), or some kind of caching of metrics. Caching does mean more pre-processing though and it is perhaps a bit cosmetical. Any specific query (e.g. looking for a text inside your bin) would render your cache useless. A practical suggestion would be to split your bins into multiple smaller bins prior to analysis.

Having a budget to hire a web 2.0 developer who can design a responsive Ajax-based UI for TCAT which calculates a bunch of stuff in the background would also be great.

psegovias commented 5 years ago

@dentoir hi, what do you mean with: "2) faster storage engine (now supported in latest TCAT)" what storage engine now support tcat?

dentoir commented 5 years ago

Fresh installation of TCAT on Debian and Ubuntu now install and enable the TokuDB database engine by default. This significantly boosts performance of queries, though it is certainly no magic bullet for bins of > 50G size. I think additional improvements will have to be sought in the area of query design (in the areas analysis/index.php and analysis/common/functions.php) and smart caching.