cea-hpc / robinhood

Robinhood Policy Engine : a versatile tool to monitor filesystem contents and schedule actions on filesystem entries.
http://robinhood.sf.net
Other
178 stars 62 forks source link

changelog reader ACCT_STAT insert hangs #90

Open mstud opened 6 years ago

mstud commented 6 years ago

Hi, I'm using robinhood 3.1.0 (but this also happened with older versions) and have the problem that with engine = InnoDB the changelog reader always hangs on and INSERT into ACCT_STAT. This is the query: INSERT INTO ACCT_STAT(uid,gid,type,size,blocks, count,sz0,sz1,sz32,sz1K,sz32K,sz1M,sz32M,sz1G,sz32G,sz1T) VALUES (NEW.uid,NEW.gid,NEW.type,NEW.size,NEW.blocks,1,NEW.size=0, NAME_CONST('val',3)=0, NAME_CONST('val',3)=1, NAME_CONST('val',3)=2, NAME_CONST('val',3)=3, NAME_CONST('val',3)=4, NAME_CONST('val',3)=5, NAME_CONST('val',3)=6, NAME_CONST('val',3)=7, NAME_CONST('val',3)>=8) ON DUPLICATE KEY UPDATE size=CAST(size as SIGNED)+CAST(NEW.size as SIGNED),blocks=CAST(blocks as SIGNED)+CAST(NEW.blocks as SIGNED), count=count+1,sz0=sz0+(NEW.size=0), sz1=sz1+( NAME_CONST('val',3)=0), sz32=sz32+( NAME_CONST('val',3)=1), sz1K=sz1K+( NAME_CONST('val',3)=2), sz32K=sz32K+( NAME_CONST('val',3)=3), sz1M=sz1M+( NAME_CONST('val',3)=4), sz32M=sz32M+( NAME_CONST('val',3)=5), sz1G=sz1G+( NAME_CONST('val',3)=6), sz32G=sz32G+( NAME_CONST('val',3)=7), sz1T=sz1T+( NAME_CONST('val',3)>=8) It just does not terminate, even if there is no load at all on the database. We are using mariadb 10.1.30. The funny thing is, that it works perfectly fine when I just do a full scan (--scan --once), but it only happens in the changelog reader (--read-log). Also, this only happens with engine = InnoDB. If I fall back on MyISAM, it works fine as well. But since InnoDB is the recommendation and we are having performance issues anyway, I wanted to try InnoDB.

Any idea what may be causing this? It's not really a new issue, I remember having tried this already a year ago with older MySQL and robinhood versions, same problem. Let me know if you need any more information.

gabrieleiannetti commented 6 years ago

Hi,

how do you know exactly that the insert query on the ACCT_STAT table is hanging?

Did you call e.g. show full processlist;?

If yes, what is the exact output of it, when it hangs? Eventually there are other SQL statements running at the same time? I can't imaging that just a call on the ACCT_STAT table is the cause of the issue.

Hmm, probably there is an issue related to InnoDB since you wrote that with MyISAM there is no performance issue at this.

Can you provide InnoDB specific settings e.g. with show variables like 'innodb%';?

How long did the full scan take for how many entries?

What is the size of the RAM on the database host?

What kind of discs are you using for the database storage?

Regards Gabriele

mstud commented 6 years ago

Hi,

the output was from "show full processlist;", yes. There were no other active queries at the time. I find this strange myself. You can find my innodb settings below, but I also tried it with the MariaDB defaults, to no avail.

Yes, with MyISAM it is working perfectly fine, it's just InnoDB.

Full scan took around 3.76 hours for 21551937 entries. The db host has 128GB of RAM. I am currently using NVMe storage for the database but have also tried a local spinning disk, that doesn't make a difference for my problem though.


+---------------------------------------------+-------------------------+
| Variable_name                               | Value                   |
+---------------------------------------------+-------------------------+
| innodb_adaptive_flushing                    | ON                      |
| innodb_adaptive_flushing_lwm                | 10.000000               |
| innodb_adaptive_hash_index                  | ON                      |
| innodb_adaptive_hash_index_partitions       | 1                       |
| innodb_adaptive_max_sleep_delay             | 150000                  |
| innodb_additional_mem_pool_size             | 8388608                 |
| innodb_api_bk_commit_interval               | 5                       |
| innodb_api_disable_rowlock                  | OFF                     |
| innodb_api_enable_binlog                    | OFF                     |
| innodb_api_enable_mdl                       | OFF                     |
| innodb_api_trx_level                        | 0                       |
| innodb_autoextend_increment                 | 64                      |
| innodb_autoinc_lock_mode                    | 2                       |
| innodb_background_scrub_data_check_interval | 3600                    |
| innodb_background_scrub_data_compressed     | OFF                     |
| innodb_background_scrub_data_interval       | 604800                  |
| innodb_background_scrub_data_uncompressed   | OFF                     |
| innodb_buf_dump_status_frequency            | 0                       |
| innodb_buffer_pool_dump_at_shutdown         | OFF                     |
| innodb_buffer_pool_dump_now                 | OFF                     |
| innodb_buffer_pool_dump_pct                 | 100                     |
| innodb_buffer_pool_filename                 | ib_buffer_pool          |
| innodb_buffer_pool_instances                | 64                      |
| innodb_buffer_pool_load_abort               | OFF                     |
| innodb_buffer_pool_load_at_startup          | OFF                     |
| innodb_buffer_pool_load_now                 | OFF                     |
| innodb_buffer_pool_populate                 | OFF                     |
| innodb_buffer_pool_size                     | 85899345920             |
| innodb_change_buffer_max_size               | 25                      |
| innodb_change_buffering                     | all                     |
| innodb_checksum_algorithm                   | INNODB                  |
| innodb_checksums                            | ON                      |
| innodb_cleaner_lsn_age_factor               | HIGH_CHECKPOINT         |
| innodb_cmp_per_index_enabled                | OFF                     |
| innodb_commit_concurrency                   | 0                       |
| innodb_compression_algorithm                | zlib                    |
| innodb_compression_failure_threshold_pct    | 5                       |
| innodb_compression_level                    | 6                       |
| innodb_compression_pad_pct_max              | 50                      |
| innodb_concurrency_tickets                  | 5000                    |
| innodb_corrupt_table_action                 | assert                  |
| innodb_data_file_path                       | ibdata1:100M:autoextend |
| innodb_data_home_dir                        |                         |
| innodb_default_encryption_key_id            | 1                       |
| innodb_defragment                           | OFF                     |
| innodb_defragment_fill_factor               | 0.900000                |
| innodb_defragment_fill_factor_n_recs        | 20                      |
| innodb_defragment_frequency                 | 40                      |
| innodb_defragment_n_pages                   | 7                       |
| innodb_defragment_stats_accuracy            | 0                       |
| innodb_disable_sort_file_cache              | OFF                     |
| innodb_disallow_writes                      | OFF                     |
| innodb_doublewrite                          | ON                      |
| innodb_empty_free_list_algorithm            | BACKOFF                 |
| innodb_encrypt_log                          | OFF                     |
| innodb_encrypt_tables                       | OFF                     |
| innodb_encryption_rotate_key_age            | 1                       |
| innodb_encryption_rotation_iops             | 100                     |
| innodb_encryption_threads                   | 0                       |
| innodb_fake_changes                         | OFF                     |
| innodb_fast_shutdown                        | 1                       |
| innodb_fatal_semaphore_wait_threshold       | 600                     |
| innodb_file_format                          | Antelope                |
| innodb_file_format_check                    | ON                      |
| innodb_file_format_max                      | Antelope                |
| innodb_file_per_table                       | ON                      |
| innodb_flush_log_at_timeout                 | 1                       |
| innodb_flush_log_at_trx_commit              | 2                       |
| innodb_flush_method                         | O_DIRECT                |
| innodb_flush_neighbors                      | 1                       |
| innodb_flushing_avg_loops                   | 30                      |
| innodb_force_load_corrupted                 | OFF                     |
| innodb_force_primary_key                    | OFF                     |
| innodb_force_recovery                       | 0                       |
| innodb_foreground_preflush                  | EXPONENTIAL_BACKOFF     |
| innodb_ft_aux_table                         |                         |
| innodb_ft_cache_size                        | 8000000                 |
| innodb_ft_enable_diag_print                 | OFF                     |
| innodb_ft_enable_stopword                   | ON                      |
| innodb_ft_max_token_size                    | 84                      |
| innodb_ft_min_token_size                    | 3                       |
| innodb_ft_num_word_optimize                 | 2000                    |
| innodb_ft_result_cache_limit                | 2000000000              |
| innodb_ft_server_stopword_table             |                         |
| innodb_ft_sort_pll_degree                   | 2                       |
| innodb_ft_total_cache_size                  | 640000000               |
| innodb_ft_user_stopword_table               |                         |
| innodb_idle_flush_pct                       | 100                     |
| innodb_immediate_scrub_data_uncompressed    | OFF                     |
| innodb_instrument_semaphores                | OFF                     |
| innodb_io_capacity                          | 100000                  |
| innodb_io_capacity_max                      | 200000                  |
| innodb_kill_idle_transaction                | 0                       |
| innodb_large_prefix                         | OFF                     |
| innodb_lock_schedule_algorithm              | fcfs                    |
| innodb_lock_wait_timeout                    | 600                     |
| innodb_locking_fake_changes                 | ON                      |
| innodb_locks_unsafe_for_binlog              | OFF                     |
| innodb_log_arch_dir                         | ./                      |
| innodb_log_arch_expire_sec                  | 0                       |
| innodb_log_archive                          | OFF                     |
| innodb_log_block_size                       | 512                     |
| innodb_log_buffer_size                      | 268435456               |
| innodb_log_checksum_algorithm               | INNODB                  |
| innodb_log_compressed_pages                 | ON                      |
| innodb_log_file_size                        | 524288000               |
| innodb_log_files_in_group                   | 2                       |
| innodb_log_group_home_dir                   | ./                      |
| innodb_lru_scan_depth                       | 1024                    |
| innodb_max_bitmap_file_size                 | 104857600               |
| innodb_max_changed_pages                    | 1000000                 |
| innodb_max_dirty_pages_pct                  | 15.000000               |
| innodb_max_dirty_pages_pct_lwm              | 0.001000                |
| innodb_max_purge_lag                        | 0                       |
| innodb_max_purge_lag_delay                  | 0                       |
| innodb_mirrored_log_groups                  | 1                       |
| innodb_monitor_disable                      |                         |
| innodb_monitor_enable                       |                         |
| innodb_monitor_reset                        |                         |
| innodb_monitor_reset_all                    |                         |
| innodb_mtflush_threads                      | 8                       |
| innodb_old_blocks_pct                       | 37                      |
| innodb_old_blocks_time                      | 1000                    |
| innodb_online_alter_log_max_size            | 134217728               |
| innodb_open_files                           | 1800                    |
| innodb_optimize_fulltext_only               | OFF                     |
| innodb_page_size                            | 16384                   |
| innodb_prefix_index_cluster_optimization    | OFF                     |
| innodb_print_all_deadlocks                  | OFF                     |
| innodb_purge_batch_size                     | 300                     |
| innodb_purge_threads                        | 1                       |
| innodb_random_read_ahead                    | OFF                     |
| innodb_read_ahead_threshold                 | 56                      |
| innodb_read_io_threads                      | 4                       |
| innodb_read_only                            | OFF                     |
| innodb_replication_delay                    | 0                       |
| innodb_rollback_on_timeout                  | OFF                     |
| innodb_rollback_segments                    | 128                     |
| innodb_sched_priority_cleaner               | 19                      |
| innodb_scrub_log                            | OFF                     |
| innodb_scrub_log_speed                      | 256                     |
| innodb_show_locks_held                      | 10                      |
| innodb_show_verbose_locks                   | 0                       |
| innodb_simulate_comp_failures               | 0                       |
| innodb_sort_buffer_size                     | 1048576                 |
| innodb_spin_wait_delay                      | 6                       |
| innodb_stats_auto_recalc                    | ON                      |
| innodb_stats_include_delete_marked          | OFF                     |
| innodb_stats_method                         | nulls_equal             |
| innodb_stats_modified_counter               | 0                       |
| innodb_stats_on_metadata                    | OFF                     |
| innodb_stats_persistent                     | ON                      |
| innodb_stats_persistent_sample_pages        | 20                      |
| innodb_stats_sample_pages                   | 8                       |
| innodb_stats_traditional                    | ON                      |
| innodb_stats_transient_sample_pages         | 8                       |
| innodb_status_output                        | OFF                     |
| innodb_status_output_locks                  | OFF                     |
| innodb_strict_mode                          | OFF                     |
| innodb_support_xa                           | ON                      |
| innodb_sync_array_size                      | 1                       |
| innodb_sync_spin_loops                      | 30                      |
| innodb_table_locks                          | ON                      |
| innodb_thread_concurrency                   | 40                      |
| innodb_thread_sleep_delay                   | 0                       |
| innodb_tmpdir                               |                         |
| innodb_track_changed_pages                  | OFF                     |
| innodb_undo_directory                       | .                       |
| innodb_undo_logs                            | 128                     |
| innodb_undo_tablespaces                     | 0                       |
| innodb_use_atomic_writes                    | OFF                     |
| innodb_use_fallocate                        | OFF                     |
| innodb_use_global_flush_log_at_trx_commit   | ON                      |
| innodb_use_mtflush                          | OFF                     |
| innodb_use_native_aio                       | ON                      |
| innodb_use_stacktrace                       | OFF                     |
| innodb_use_sys_malloc                       | ON                      |
| innodb_use_trim                             | OFF                     |
| innodb_version                              | 5.6.36-82.2             |
| innodb_write_io_threads                     | 4                       |
+---------------------------------------------+-------------------------+
gabrieleiannetti commented 6 years ago

Hi,

one important note for you I missed in my first post...

If I fall back on MyISAM, it works fine as well. But since InnoDB is the recommendation and we are having performance issues anyway, I wanted to try InnoDB.

If you have performance issues in general with InnoDB it is not getting better, since it has higher hardware requirements and is slower IMHO then MyISAM. What do you exactly mean with 'we are having performance issues anyway'?

Can you figure out if the database server has swapped or the RAM is sufficient when running with InnoDB?

About the full scan you have mentioned... Did you do that with accounting information enabled e.g. is the ACCT_STAT table populated with data after the full scan took place?

What is the state of the query when you call show full processlist and the above mentioned query is hanging?

Since we are using InnoDB in Version 5.5.52 I am not quite familiar with all the parameters you have provided. But as far as I can see, there is not much suspicious looking to me. Anyway I would recommend you to use mostly the default values for the parameters to have a good starting point e.g. setting innodb_compression_algorithm = None (you can test and tune the parameters later). I am not sure about this, but the but the innodb_buffer_pool_size with 85GB size could be the issue.

Not sure, we need more investigation about this here, but the innodb_buffer_pool_size with 85GB size could be the issue. First I would set the set the innodb_buffer_pool_instances to 8 or even 1.

Do you know exactly the disc spaces required for the innodb tables?

You should also check if the Buffer Pool has no free pages left: SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_pages_free';

Even better would be to try to understand how your Inno DB Buffer Pool is performing...

Here are few interesting and helpful resources to get started...

After checking the current state I would do a clean full scan of you Lustre filesystem installation with an empty Buffer Pool. My naive assumption would be, that maybe your Buffer Pool is full and gets struggling when even more data is inserted and rows are updated by the processing of Lustre changelogs?

You could also check the logs from Robinhood during full scan and changelog processing regarding speed and 'EntryProcessor Pipeline Stats' which is documented on the Robinhood documentation... probably you can figure out that the database performance is the bottleneck.

Hope that helps!

Gabriele Hi,

one important note for you I missed in my first post...

If I fall back on MyISAM, it works fine as well. But since InnoDB is the recommendation and we are having performance issues anyway, I wanted to try InnoDB.

If you have performance issues in general with InnoDB it is not getting better, since it has higher hardware requirements and is slower IMHO then MyISAM. What do you exactly mean with 'we are having performance issues anyway'?

Can you figure out if the database server has swapped or the RAM is sufficient when running with InnoDB?

About the full scan you have mentioned... Did you do that with accounting information enabled e.g. is the ACCT_STAT table populated with data after the full scan took place?

What is the state of the query when you call show full processlist and the above mentioned query is hanging?

Since we are using InnoDB in Version 5.5.52 I am not quite familiar with all the parameters you have provided. But as far as I can see, there is not much suspicious looking to me. Anyway I would recommend you to use mostly the default values for the parameters to have a good starting point e.g. setting innodb_compression_algorithm = None (you can test and tune the parameters later). I am not sure about this, but the but the innodb_buffer_pool_size with 85GB size could be the issue.

Not sure, we need more investigation about this here, but the innodb_buffer_pool_size with 85GB size could be the issue. First I would set the set the innodb_buffer_pool_instances to 8 or even 1.

Do you know exactly the disc spaces required for the innodb tables?

You should also check if the Buffer Pool has no free pages left: SHOW GLOBAL STATUS LIKE 'Innodb_buffer_pool_pages_free';

Even better would be to try to understand how your Inno DB Buffer Pool is performing...

Here are few interesting and helpful resources to get started...

After checking the current state I would do a clean full scan of you Lustre filesystem installation with an empty Buffer Pool. My naive assumption would be, that maybe your Buffer Pool is full and gets struggling when even more data is inserted and rows are updated by the processing of Lustre changelogs?

You could also check the logs from Robinhood during full scan and changelog processing regarding speed and 'EntryProcessor Pipeline Stats' which is documented in https://github.com/cea-hpc/robinhood/wiki/pipeline_tuning - probably you can figure out that the database performance is the bottleneck.

Hope that helps!

Gabriele