elastic / curator

Curator: Tending your Elasticsearch indices
Other
3.04k stars 635 forks source link

name-based age filter is not filtering indices properly #1727

Closed giom-l closed 6 hours ago

giom-l commented 1 month ago

We recently updated our Elastic cluster from 7.17 to 8.15.1 and we also moved from Curator 5.8.4 to 8.0.16.

After some run, we spotted some issues that we didn't detect before. Some indices that do not match provided timestring are deleted.

Let me add some examples to be more clear

Use case 1

Index : index-prod (no timestring at all) curator config :

---

actions:
   monthly:
   action: delete_indices
   description: 'Delete old monthly indices'
   options:
      continue_if_exception: false
      ignore_empty_list: true
   filters:
      - filtertype: age
        source: name
        direction: older
        timestring: '%Y.%m'
        unit: months
        unit_count: 2

Expected result : index-prod should not be touched since it does not match the timestring. Actual result : index-prod is deleted

Use case 2

Lack of documentation reading explained this usecase, as stated here.

I just kept it collapsed so following discussion does not become strange with missing references. Index : `index-prod-2024.09` (timestring with year and month) curator config : ```yaml --- actions: monthly: action: delete_indices description: 'Delete old daily indices' options: continue_if_exception: false ignore_empty_list: true filters: - filtertype: age source: name direction: older timestring: '%Y.%m.%d' unit: days unit_count: 5 ``` Expected result : `index-prod-2024.09` should not be touched since it does not match the timestring. Actual result : `index-prod-2024.09` is deleted

Logs showing the issue

curator.log

Findings

After digging a little bit, I think I found the culprit (but not really sure about potential side effects). When retrieving an index, indexlist populates self.index_info with zero_values

However, in _get_name_based_ages, we handled the timestring as :

But what is happening if the epoch is not an int (e.g. None, if the epoch can't be provided since the timestring does not match anything) ? Well, since the age/name property has been initialized to 0, it stays as is.

And since 0 < point of reference , it gets deleted.

Just for the explanation, I understood that it was working with version 5.8.4 because index settings initialization is quite different and age property was initialized with {}

Workaround

For the moment, to workaround this, we have to add an additional filter that will match the pattern, like

       - filtertype: pattern
         kind: regex
         value: ^index-prod-\d{4}\.\d{2}\.\$

for the second test.

Potential fix

I'll propose a fix in a PR (with 2 additional tests). The main idea is to discard any index that do not match the timestring by setting the age/name to sys.maxsize

However I'm not sure if it is the best fix or if something smarter could be done.

untergeek commented 1 month ago

This is a known issue. The workaround of adding a second, excluding filter is documented (see the Warning block).

I understand that you were caught off-guard by this, and it deleted something you did not expect. However, I'm not sure I want a code change that will alter the existing behavior.

I'm comfortable with Curator requiring the end user to explicitly exclude things. Making a change that goes from implicit inclusion (timestring-%Y.%m matching timestring-%Y.%m.%d) to implicit exclusion (timestring-%Y.%m not matching timestring-%Y.%m.%d) would be a breaking change after this many years.

Thoughts?

giom-l commented 1 month ago

I tend to disagree with would be a breaking change after this many years. I tried to backport my 2 additional tests in branch 7.x and both of them ran fine without any modification of the code. So something changed the behavior since 8.0

But I understand your point of view and agree with the workaround of use case 2 (as it is documented). Maybe we made usage of a behavior that wasn't really wanted. Since the match is done anywhere in the index name, it may partially match something more specific and the user has to exclude this explicitly.

Let's forget about usecase 2 (which is documented as you said) and focus only on usecase 1.

Use case 1 still seems to be a bug to me. I asked the curator to match any index whose name is having a specific timestring in it. If it matches an index that have this pattern (even if it's more specific), it's ok.

But here, the index name does not match at all with the pattern (it does not even have a timestring in its name). Here the workaround is to add another include pattern that matches the same as the timestring. And it's really redundant. It gives the feeling that the first filter is not doing its duty properly, right ?

trungcle commented 3 days ago

@untergeek I have a customer logged in support case 01770219 and complaint the same the above issue1 after they upgraded ES from 7.17.x to 8.14.3 and Curator from 5.8 to 8.0.16

They are using the same action file:

  2:
    action: delete_indices
    description: >-
      Delete indices older than 4 days
    options:
      disable_action: False
      ignore_empty_list: True
      continue_if_exception: True
    filters:
    - filtertype: age
      source: name
      direction: older
      timestring: '%Y.%m.%d'
      unit: days
      unit_count: 4
    - filtertype: pattern
      kind: regex
      exclude: True
      value: '^((os_metrics|atm_fiserv)-.*|healthcheck-2023.09.06|healthcheck-2023.07.03)'

could it be still a bug and shall we fix it or customer needs to add a second, excluding filter documented (see the Warning block). However, given there are over 300+ different indices getting deleted compare with around only 20 indices before upgrade, I am not sure how we can add some excluding filter here.

Or would you please advise a workaround on this case for action 2.

Thank you -

untergeek commented 3 days ago

Hi, @trungcle

The OP complaint is of matching YYYY.MM.dd indices when requesting YYYY.MM indices. This should not be a surprise as YYYY.MM is part of YYYY.MM.dd. But your addendum does not appear to have that problem based on the example Action ID 2.

Let's look at what you've shared. From this glance, it does not look the same as the OP complaint at all.

    - filtertype: age
      source: name
      direction: older
      timestring: '%Y.%m.%d'
      unit: days
      unit_count: 4

As this is the first filter, this will match any indices containing a time string of YYYY.MM.dd, which translates into a regular expression of \d{4}\.\d{2}\.\d{2}: 4 digits, a period, 2 digits, a period, and 2 more digits. This is effectively what the name: source age filter does. Once it does a capture group for that pattern, it will extract the date from it, and then use unit: days and unit_count: 4 to calculate the index age, to see if it should be kept or discarded.

That is followed by this remaining filter for Action ID 2:

    - filtertype: pattern
       kind: regex
       exclude: True
       value: '^((os_metrics|atm_fiserv)-.*|healthcheck-2023.09.06|healthcheck-2023.07.03)'

This will exclude anything starting with os_metrics- or atm_fiserv-, and exact matches healthcheck-2023.09.06 and healthcheck-2023.07.03. Any of these would be excluded from the "running list" piped to the second filter from the first filter.

Without DEBUG logging, I cannot see what was happening. They have logging set to INFO level. With DEBUG logging, I'd be able to see how it's doing each selection. They could even do a --dry-run to show the output in debug mode.

What I can see for the end behavior of Action 2 is this:

2024-10-20 20:54:33,526 INFO      Deleting 334 selected indices: ['agent_status-2065.06', 'naas_service-2024.10', 'mainframe_map', '.ent-search-actastic-document_types-engine_id-slug-unique-constraint', 'watcher_test_index_setb', '.ent-search-actastic-connectors_jobs_v4', 'naas_service-2024.09', 'logstash_fail_events-2024.09', 'logstash_pipeline_monitoring-2024.10.15', '.kibana_task_manager_7.15.0_001', 'logstash_pipeline_monitoring-2024.10.16', 'db_events-2065.06', '.kibana_7.16.1_001', '.ent-search-actastic-users_v5-email-unique-constraint', '.ent-search-actastic-workplace_search_organizations_v10', '.ent-search-esqueues-me_worker_v1', 'app_support_tickets', '.monitoring-es-7-2024.10.16', 'backup_itas', '.ent-search-actastic-oauth_access_grants', '.monitoring-es-7-2024.10.15', 'invensys_data-2024.09', 'logstash_fail_events-2024.10', '.security-tokens-7', 'invensys_data-2024.10', 'mainframe_cdpz-v2-2024.10.16', 'audit_itas', '.ent-search-actastic-crawler_seed_urls', '.ent-search-actastic-workplace_search_accounts_v13-user_oid-unique-constraint', 'mainframe_cdpz-2024.10.16', 'mainframe_cdpz-2024.10.15', '.ent-search-actastic-app_search_search_settings', '.ent-search-esqueues-me_queue_v1_engine_destroyer', '.ent-search-actastic-oauth_access_tokens-token-unique-constraint', 'ichamp_cr_data', '.ent-search-actastic-reindex_jobs', 'zabbix_os_metrics_sit-2024.10.15', 'zabbix_os_metrics_sit-2024.10.16', '.ent-search-esqueues-me_queue_v1_workplace_search', '.ent-search-actastic-togo_migrations_v1', '.fleet-enrollment-api-keys-7', 'apm-7.12.1-span-2024.10.15', 'logstash_fail_events-2045.05', '.watches-reindexed-for-8', '.ent-search-actastic-app_search_crawler_content_metadata', 'mainframe_cdpz-v2-2024.10.15', 'agent_status-2029.09', 'tws.backup.20042022', '.ent-search-actastic-oauth_access_tokens', '.ent-search-actastic-workplace_search_content_source_jobs_v3', '.ent-search-actastic-workplace_search_search_groups_v3', '.ent-search-esqueues-me_queue_v1_seed_sample_engine', 'concurrent_users-2023', 'agent_status-2029.12', 'concurrent_users-2020', 'datahub_gateway_users-2022', 'datahub_gateway_users-2023', '.ent-search-actastic-oauth_access_grants-token-unique-constraint', 'datahub_gateway_users-2024', 'network_data_temp-2024', 'ml-notifications-uat', '.ent-search-actastic-app_search_document_position_queries_v3', 'atm-2024.09', 'metrics-system.diskio', 'tws_host_mapping', 'mage_logs-2024.10', '.ent-search-actastic-crawler_robots_txts_v2', '.ent-search-esqueues-me_queue_v1_refresh_document_counts', 'im_billboard', 'mage_logs-2024.09', 'apm-7.12.1-transaction-2024.10.15', 'agent_status-2030.12', 'atm-2024.10', 'recon-2024.10', 'bp_branch-2024.10.16', '.kibana_7.12.0_001', '.kibana_task_manager_7.12.0_001', 'ichamp_incident', '.security-6-reindexed-for-8', 'survey_model_itas', 'sonar-alert-configuration', 'zabbix_template_monitor', 'recon-2024.09', 'metrics-index_pattern_placeholder', 'metrics-system.icmp', '.fleet-policies-7', '.kibana_2-reindexed-for-8', 'mainframe_cdpz_wlm-2024.10', 'mips_hourly-cdpz-jem-2021', 'mips_hourly-cdpz-jem-2022', '.ent-search-esqueues-me_queue_v1_mailer', 'logstash_fail_events-2027.06', 'mips_hourly-cdpz-jem-2020', '.ent-search-esqueues-me_queue_v1_failed', 'logstash_fail_events-2027.05', 'nci', 'mips_report_monthly_cdpz-jem-2022', 'watcher_config', '.transform-notifications-000002', '.reindexed-v7-management-beats', 'apm-7.12.1-error-2024.10.15', 'zabbix_os_events-2024.10.15', '.ent-search-actastic-crawler_domains_v2', 'zabbix_os_events-2024.10.16', '.ent-search-actastic-app_search_role_mapping_engines_v2', 'netcool_service-2024.10', 'db2_netcool_data-2023.09', 'bp_branch-2024.10.15', 'logstash_fail_events-2027.11', 'netcool_service-2024.09', 'cs_services-2024.10.16', 'samplecicidwatcher1', 'mainframe_cdpz_wlm-2024.09', 'watcher_test__zabbix_sample', 'app_threshold_config', '.ent-search-actastic-synonyms', 'metrics-system.memory', 'ichamp_data-2024.09', 'omnibus-2024.10.16', 'db2_netcool_data-2024.09', 'omnibus-2024.10.15', 'network-2024.09', '.ent-search-actastic-telemetry_status_v2', 'watcher_test_zabbix_sample', '.triggered_watches-reindexed-for-8', '.ml-config-reindexed-for-8', 'network-2024.10', '.ent-search-actastic-engines_v12-key-unique-constraint', '.ent-search-actastic-users_v5', '.ent-search-actastic-app_search_roles_v2', 'mips_monthly-cdpz-jem-2021', 'mips_monthly-cdpz-jem-2022', 'mips_monthly-cdpz-jem-2020', 'mips_monthly-cdpz-jem-2023', 'mips_monthly-cdpz-jem-2024', '.monitoring-alerts-7', '.ent-search-actastic-engines_v12', 'application_performance_prod-2022', 'db2_netcool_data-2024.10', 'backup_itas_version_2', 'ichamp_data-2024.10', 'mage_kib_logs-', 'rollup_itas', '.ent-search-actastic-crawler_domains_v2-engine_oid-name-unique-constraint', '.ent-search-actastic-workplace_search_role_mappings_v3', 'logstash_fail_events-2029.06', 'agent_status-2024.10', 'agent_status-2024.12', '.ent-search-actastic-clusters_v2-name-unique-constraint', 'mage_elasticsearch_logs-2024.10.15', 'mage_elasticsearch_logs-2024.10.16', '.ent-search-actastic-users_v5-auth_source-elasticsearch_username-unique-constraint', 'apm-7.12.1-onboarding-2024.10.15', 'agent_status-2024.09', 'im_incident', '.ent-search-actastic-app_search_role_mapping_engines_v2-engine_oid-loco_togo_role_mapping_id-unique-constraint', 'metrics-system.network', '.ent-search-actastic-app_search_crawler_content_url_metadata', '.ent-search-engine-documents-source-60a697c37e7fd9917aeeb2fb', '.monitoring-kibana-7-2024.10.15', 'tws_status-2024.09', '.monitoring-kibana-7-2024.10.16', 'reindexed-v7-websphere_tp_max', '.kibana_3-reindexed-for-8', '.ent-search-actastic-app_search_api_token_engines', 'itsb_events', 'tws_status-2024.10', 'checksum_check-2024.09', '.ent-search-actastic-engine_document_backends_v2', 'hc-demo', 'im_billboard-2023', 'sonar_alerts-2023', 'im_billboard-2024', 'im_billboard-2021', 'im_billboard-2022', '.ent-search-actastic-user_external_identities_v1-external_id-service_type-unique-constraint', 'network_ces-2024.10.15', 'network_ces-2024.10.16', '.transform-internal-004', 'metrics-system.cpu', 'sr_survey_reports', '.kibana_7.15.0_001', 'itsb_announcements-2023', 'itsb_announcements-2024', '.ent-search-actastic-index_pointers_v2', '.kibana_task_manager_7.16.1_001', 'jenkin_payload-2020', 'healthcheck-2024.10.15', 'healthcheck-2024.10.16', '.ent-search-actastic-crawler_seed_urls-domain_oid-url-unique-constraint', '.ent-search-actastic-app_search_crawler_content_metadata-content_hash-engine_oid-unique-constraint', '.ent-search-actastic-oauth_access_tokens-refresh_token-unique-constraint', '.ent-search-db-lock-20200304', '.ent-search-actastic-workplace_search_content_sources_v9', '.ent-search-actastic-app_search_accounts_v9-key-unique-constraint', 'filebeat-7.13.3', '.ent-search-actastic-workplace_search_invitations-code-unique-constraint', '.ent-search-actastic-workplace_search_roles', 'data_center-2024', 'watcher', 'server_uptime-2024.10.16', 'logstash_fail_events-', 'server_uptime-2024.10.15', 'itas', '.ent-search-actastic-workplace_search_content_source_identities', 'logstash_fail_events-2028.05', '.kibana-observability-ai-assistant-conversations-000001', 'agent_status-2025.12', 'logstash_fail_events-2028.04', 'metrics-system.process', 'agent_status-2058.11', 'agent_status-2025.09', '.ent-search-actastic-workplace_search_group_assignments', 'osmetricsrlp', 'logstash_fail_events-jem-2024.10', 'assets-2024.10.15', 'im_billboard-null', '.ent-search-actastic-app_search_api_tokens_v3', 'assets-2024.10.16', 'os_events-2027.05', '.ent-search-actastic-crawler_crawl_requests_v4', 'logstash_fail_events-jem-2024.09', 'tws_job_summary-2024.10', 'problem_tickets-2024.10', 'watcher_test_index1', '.ent-search-actastic-document_types', '.ent-search-engine-documents-mage-appsearch', 'agent_status-2033.12', 'command_bridge-2024', 'tws_job_summary-2024.09', 'network_data-2021', 'network_data-2020', 'agent_status-2080.12', '.ent-search-actastic-app_search_api_tokens_v3-authentication_token-unique-constraint', 'mips_hourly-cdpz-2020', 'network_data-2023', 'network_data-2022', 'apm-7.12.1-metric-2024.10.15', 'network_data-2024', 'command_bridge-2023', 'command_bridge-2021', 'command_bridge-2020', 'netcool_alert-2020', '.ent-search-actastic-oauth_applications', 'netcool_alert-2021', 'healthcheck_reports-2024.09', 'netcool_alert-2023', 'netcool_alert-2024', '.monitoring-logstash-7-2024.10.16', '.monitoring-logstash-7-2024.10.15', 'ilm-history-1-000010', '.kibana_task_manager_7.17.4_001', '.ent-search-actastic-secret_keeper_secrets', '.ent-search-actastic-crawler_crawl_rules', '.tasks-reindexed-for-8', 'problem_tickets-2024.09', '.kibana-observability-ai-assistant-kb-000001', 'logstash_fail_events-2033.05', 'healthcheck_reports-2024.10', 'reindexed-v7-sq-events', 'mage_application_logs-2024.10', 'sonar_agent_utilization-2024.10', 'watcher_test_index_sample', 'monindextest', 'incidents_soe-2021', 'heartbeat-7.17.4', '.ent-search-actastic-app_search_role_engines_v2', 'shipper_agent_events-2024.10', 'cics_state-2024.10', '.ent-search-actastic-elasticsearch_indices', 'euxi-2024-07', 'osmetricsrup', 'euxi-2024-09', 'os_events-2035.09', 'cloud_metrics-2024.10', 'cics_state-2024.09', 'checksum_check-2024.10', '.ent-search-actastic-engines_v12-account_id-loco_moco_account_id-slug-unique-constraint', 'incidents_soe-2023', 'incidents_soe-2022', 'incidents_soe-2024', 'shipper_agent_events-2024.09', '.ent-search-actastic-workplace_search_invitations', 'mage_application_logs-2024.09', 'cloud_metrics-2024.09', 'sonar_agent_utilization-2024.09', '.kibana_7.13.3_001', '.ent-search-actastic-app_search_invitations_v3', 'logstash_fail_events-2065.06', '.ent-search-actastic-clusters_v2', 'grafana_user-2024', 'euxi-2024.10', 'grafana_user-2023', 'cloud_events-2024.10.15', 'grafana_user-2022', 'db_events-2024.10', 'grafana_user-2021', 'grafana_user-2020', 'trigger_netcool_alerts-2024.10', 'zabbix_db_events-2024.10', '.ent-search-actastic-app_search_accounts_v9', '.kibana_task_manager_7.13.3_001', '.ent-search-actastic-workplace_search_pre_content_sources-context-workplace_search_account_id-service_type-unique-constraint', 'db_events-2024.09', 'cloud_events-2024.10.16', '.ent-search-actastic-app_search_document_positions', '.ent-search-actastic-workplace_search_roles-workplace_search_account_id-workplace_search_organization_id-unique-constraint', 'trigger_netcool_alerts-2024.09', '.ent-search-actastic-user_external_identities_v1', 'process_metrics-2024.09', '.ent-search-esqueues-me_queue_v1_connectors', '.ent-search-actastic-workplace_search_roles-workplace_search_account_id-unique-constraint', 'euxi-2024.06', 'euxi-2024.07', 'euxi-2024.08', 'euxi-2024.09', '.slm-history-1-000010', '.ent-search-actastic-workplace_search_pre_content_sources', '.kibana_1-reindexed-for-8', 'reindexed-v7-chatbot', 'euxi-2024.02', 'filebeat-7.17.9', 'euxi-2024.03', 'euxi-2024.04', '.kibana_7.17.4_001', 'euxi-2024.05', 'process_metrics-2024.10', 'agent_status-2033.09', '.ent-search-actastic-oauth_applications-uid-unique-constraint', 'watcher_test_index', 'metrics-system.filesystem', '.ent-search-actastic-app_search_role_mappings_v2', '.ent-search-actastic-workplace_search_accounts_v13', 'agent_status-2033.05']

So, two things are immediately apparent in this. One filter seems to have done exactly what it was supposed to do, and the other did not.

  1. The age filter should limit itself to indices with a YYYY.MM.dd regex pattern. It does not appear to have executed at all. There are plenty of non-matching indices.
  2. The given pattern filter should exclude indices starting with os_metrics- or atm_fiserv-, or the two named heartbeat indices. This appears to be working as none of those names is in the list.

Something is very, very wrong if a filter did not filter. I do not wish to idly accuse, but without the full DEBUG output, we have to accept their word that the action is identical, and find myself rather skeptical of that. No release of Curator goes out without going through rigorous, multi-python-version (3.8 - 3.12) testing across a battery of unit and full integration tests. Feel free to inspect the unit and integration tests that do date calculations from the index name, and what they're verifying.

Long story short:

Without the debug logs, it's impossible to know what actually happened. Those logs reflect a ton of lines verifying that we're actually acting on indices and not aliases with no other indications of what it's doing as far as which filters are being run.

If I get time, I will actually run a special unit test with the original index list at the beginning of the run using their own filters. Since it's all name-based, I won't need an Elasticsearch instance. I would do this to satisfy my own curiosity about what's happening. I won't be able to even think about that until tomorrow afternoon at the earliest.

trungcle commented 3 days ago

Thanks heaps @untergeek for your quick reply on this matters.

As you can see from the end behaviour of Action 2 in curator log: es_curator_8_delete_indexes_20102024.log, For example, Among these deleting indices I noticed below indices which do not possess a timefield pattern in the index. (same to use case 1) and they are deleted:

ml-notifications-uat
metrics-system.diskio
im_billboard
ichamp_incident
metrics-system.icmp
zabbix_template_monitor

That makes me think this issue similar to "use case 1" from OP.

Given those indices were already deleted, I am not sure if we enable DEBUG and it will help us to investigate it, But I will try to get it done asap to review.

Best regards,

giom-l commented 2 days ago

Hello,

Maybe I shouldn't have mixed the 2 usecases together. Should we close this issue and reopen a new one with only the usecase 1 ?

In the meantime, I'm able to produce a debug log showing the issue

Here is the actions file used

---
actions:
  monthly:
    action: delete_indices
    description: 'Delete old monthly indices'
    options:
      continue_if_exception: false
      ignore_empty_list: true
    filters:
      - filtertype: age
        source: name
        direction: older
        timestring: '%Y.%m'
        unit: months
        unit_count: 2

And here is curator.log

As you can see, some indices remain at the end that would be deleted despite they don't even have timestring in their names.

2024-10-23 08:41:13,850 INFO       curator.helpers.utils           show_dry_run:80   DRY-RUN: delete_indices: app-didos_deployments-qa with arguments: {}
2024-10-23 08:41:13,850 INFO       curator.helpers.utils           show_dry_run:80   DRY-RUN: delete_indices: app-didos_dqc_sales_by_month-qa with arguments: {}
2024-10-23 08:41:13,850 INFO       curator.helpers.utils           show_dry_run:80   DRY-RUN: delete_indices: app-didos_dqc_sales_delta_integrity-qa with arguments: {}
2024-10-23 08:41:13,850 INFO       curator.helpers.utils           show_dry_run:80   DRY-RUN: delete_indices: app-didos_dqc_store-qa with arguments: {}
2024-10-23 08:41:13,850 INFO       curator.helpers.utils           show_dry_run:80   DRY-RUN: delete_indices: app-didos_mediator_core_monitoring-qa with arguments: {}

What we had to do was to workaround this was to add a filter to force the pattern matching on the index name :

      # Enforce the naming to match the timestring provided
      - filtertype: pattern
        kind: regex
        value: ^app-didos.+(qa)-\d{4}\.\d{2}$
untergeek commented 2 days ago

This is becoming more clear, then. It appears that the name-based age filter is not filtering indices properly.

@giom-l, feel free to just edit/amend your initial post to describe what's going on, and even edit the title accordingly. I am booked out through the rest of my morning, but I can tackle this after that, hopefully.

giom-l commented 2 days ago

Updated. I kept the usecase 2 (collapsed though) since we discussed about it and I prefer not to have missing references. For the findings of the OP, I think it remains true since it really seems to be related to initial population of self.index_info with zero values.