digitalmethodsinitiative / dmi-tcat

Digital Methods Initiative - Twitter Capture and Analysis Toolset
Apache License 2.0
367 stars 114 forks source link

Performance problems on a new installation #353

Closed Alotsi closed 5 years ago

Alotsi commented 5 years ago

Hi, I have recently installed TCAT on a new server (16 GB Memory / 320 GB SSD Disk / Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (6 Cores) / Ubuntu 18.04.2 x64) and imported data from another server (12 GB Memory / 1 TB SSD Disk / Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz (4 cores) / Ubuntu 14.04.5 LTS). While the installation and the import went smooth, I am encountering problems when trying to use the analysis module on the new server. For example, an associational profile of approx. two million tweets on the old server can be produced in under 3 minutes. However, when I tried the same with the imported data on the new server I waited for over an hour and still didn’t get any result.

What do you think could be the problem?

Thank you!

dentoir commented 5 years ago

Hi @Alotsi

Sorry for the late reply. That is strange. The only thing I can think of is that the MySQL configuration may not be optimized. Does a file called /etc/mysql/conf.d/tcat-autoconfigured.cnf exist?

Alotsi commented 5 years ago

Thanks! @dentoir Yes, we have this file, and it contains the following: [mysqld] show_compatibility_56=ON sql-mode="NO_AUTO_VALUE_ON_ZERO,ALLOW_INVALID_DATES" key_buffer_size = 5346M query_cache_limit = 128M query_cache_size = 2005M tmp_table_size = 1G max_heap_table_size = 1G max_connections = 80

dentoir commented 5 years ago

Looks perfectly normal. How large is the bin you are analyzing? Perhaps you could try to log into your MySQL server and execute the following SQL commands:

OPTIMIZE TABLE binname_tweets, binname_hashtags, binname_mentions, binname_urls, binname_media, binname_places, binname_withheld

Where your replace 'binname' with the name of your bin. The query may take quite some time on very large bins. This should make sure your tables are optimized.

Alotsi commented 5 years ago

Thanks @dentoir , the bin is rather large, around 80M tweets. How long can something like that take?

dentoir commented 5 years ago

Hi @Alotsi

Sorry for the late reply, but I find it difficult to give a generic answer to that question. New installations of TCAT can use a more modern storage engine, which improves all-round analysis times at least in my experience.