Investigate solutions for managing custom field limit in Elasticsearch

Background

We have a problem with multi-tenancy of customer logs within the same Elasticsearch index. Namely, that when people send JSON logs to the index with custom fields, those new log fields use up some number of the 2000 total fields allowed in the index (this is configurable)

So if customer A writes a bunch of JSON logs with custom fields early in the day, then all 2000 fields in the index may be used up. Then, when customer B tries to index logs later in the day with custom fields, those logs will be rejected and not show up in Elasticsearch/Kibana, which is what is happening to one of our customers now:

https://gsa-tts.slack.com/archives/C09CR1Q9Z/p1695238740977409

The only options to get around this seem to be:

Disable custom field indexing on some field, maybe named custom, and have customers index all custom fields in an object under this field. This approach wouldn't allow keyword searching on those custom fields though, which our current docs say we support: https://cloud.gov/docs/deployment/logs/#structured-logging
Store documents into a separate index per org per day (using the @cf.org field in the index name). With this approach, the custom field count is specific to each customer, which is far more flexible and unlikely to be exceeded on an individual org basis.

Option 2 is by far the better solution if we can make it work. There are some logs that have no @cf.org, so perhaps they can continue to go to an index named just by the datestamp.

Resources

Logstash output conf: https://github.com/cloud-gov/logsearch-boshrelease/blob/develop/jobs/ingestor_syslog/templates/config/input_and_output.conf.erb#L60
Current index name settings: https://github.com/cloud-gov/cg-deploy-logsearch/blob/main/logsearch-jobs.yml#L396-L401
How to handle exceeding field limits: https://discuss.elastic.co/t/approaches-to-deal-with-limit-of-total-fields-1000-in-index-has-been-exceeded/241039
How do dynamic fields work: https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic.html
Logstash output configuration doc: https://www.elastic.co/guide/en/logstash/7.17/event-dependent-configuration.html

Acceptance criteria

[ ] Testing demonstrates that users can index documents with custom fields and not conflict with custom fields indexed by documents in other customer orgs

One possible solution to this problem is to use a flattened field type and have customers nest any custom fields they want indexed from a JSON log under a key of custom, like so:

{"custom": {"foo": "bar"}}

You can then search for these logs in Kibana using the syntax custom.foo: "bar".

One problem is that it's difficult to get logging libraries to nest all of their output under a key like custom.

Also, from my testing, something in our Elasticsearch/Logstash configuration is putting the properties under app.custom.*, not custom, so the flattened field type isn't being used

In general, the approach of adding a custom index per org per day has the drawback of increasing the number of shards. And indexes/shards come with performance overhead, as described here: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/size-your-shards.html

However, that page also suggests a number of strategies we could take using the Index Lifecycle Management (ILM) policies or Curator to mitigate the impact of more indices:

Reindex all documents older than one month into a single index for that month and delete the daily indices. The effect would be reducing the number of indices significantly and thus shards
Shrink older indices. Shrinking indices reduces the number of shards they use, but we wouldn't want to do it on any "hot" data that is being queried often
Force merge indices more aggressively. We already force merge indices older than 90 days: https://github.com/cloud-gov/cg-deploy-logsearch/blob/main/logsearch-jobs.yml#L93

Or we could do some combination of all of the above.

Another idea: we could change our indexing strategy one index per org per day to one index per org per week, which would reduce the amount of additional indices that we are adding and thus shards.

It turns out the flattened field type seems not to be natively supported in Elasticsearch 7.9.3: https://www.elastic.co/guide/en/elasticsearch/reference/7.9/flattened.html. And I experienced issues in development when I tried to make the app field, where custom JSON logs are expanded, into a field type of flattened:

https://github.com/cloud-gov/logsearch-for-cloudfoundry/pull/115

In the October 2 engineering huddle, we discussed whether an indexing strategy of one index per org per day would be a good idea, since it would avoid this problem of field explosion and customers affecting each other: https://docs.google.com/document/d/1OivYiPsQdjcCuqg3sxHcLCD4ajdP-3c8K-M6hXHnXbU/edit#heading=h.2mrfeny32y80.

We decided that while logs on Opensearch is in a pre-prod state, where it is ingesting prod logs but not the system customers are using to access logs, we can evaluate whether the performance is problematic and needs mitigation

Ultimately, we have limited or no ability to address this issue in Elasticsearch given that we are stuck on an older version that doesn't supported flattened fields and we aren't going to change our indexing strategy given the possible performance implications.

We plan to possibly implement both suggestions in a forthcoming BOSH release for Opensearch.

After some experimenting, we have determined that it makes sense to implement this in OpenSearch instead of migrating our ElasticSearch indices.

The flattened data type is supported in our running version of ElasticSearch, 7.9.3. The previous deployment rollout failed for unrelated reasons. As of e3d834e, the app field is mapped to type flattened in the index-mappings-app-lfc Component Template in our dev environment. I have successfully generated new indices using the template.

However, once an index is created from the template, Kibana starts throwing errors. Landing on the homepage shows shard errors with underlying error Field [app._keyed] of type [flattened] does not support custom formats. The request includes fields that were dynamically mapped in older indices:

"docvalue_fields": [
  {
    "field": "app.@timestamp",
    "format": "date_time"
  },
  ...

Kibana makes queries to ES to populate UI elements like the list of available fields on the left-hand side of the Discover page. I believe these queries are failing because Kibana queries all indices, starting with the older, dynamically mapped indices, and ending with the newer index with the flattened app field. It expects newer indices to contain custom-formatted fields like app.@timestamp, and fails when the type is different.

Additionally, running a query against an app subfield throws TypeError: Cannot read properties of undefined (reading 'timed_out').

This suggests that we must reindex all previously-created indices using the new index templates. I migrated one index in dev to estimate how long this would take for prod. The process was not disk or network bound. CPU utilization was in the 60-75% range; MemFree averaged 260 MiB and MemAvailable averaged 1.4 GiB (of 8 GiB total), so one reindex did not saturate CPU or memory of the three data nodes in dev. To estimate the rate of indexing:

2hr for 3 t3.large machines each with 2 vCPU to reindex 9GB data = 9GB / (2hr \* 6vCPU) = 0.75GB/vCPU-hr. (using burst credits)

Scaling to the prod load and vCPU count:

180 days of retention, 511 GB avg index size = 91980 GB total
11 r5.2xlarge, each 8 vCPU. 11 * 8 = 88.
91980 GB / 88 vCPU = ~1045 GB / vCPU. 1045 GB/vCPU / .75 GB/vCPU-hr = 1393.3 hrs
1393.3 hrs / 24 hrs/day = ~60 days to reindex prod

The actual time might be shorter since our estimate is based on ~67% CPU utilization, but other factors like system updates might cause delay, so it's a good ballpark.

While a 60-day job is not inherently prohibitive, we are hoping to launch OpenSearch in a few months, so our customers would see only a few extra months benefit for not-insignificant overhead on our end.

For posterity, if we were to reindex, the process would be:

Reindex each index to a new name with a pattern Kibana will not query, like reindex-logs-app-*. This is the 60-day operation. (Indexes cannot be reindexed in-place, and in this case we would not want them to be, due to the different-field-types problem described above.)
During scheduled downtime:
1. Pre-generate the next day's index so new documents start getting indexed with the new mapping.
2. Wait until after midnight so all prior indices no longer receive new documents.
3. Make sure the last day's is reindexed.
4. Delete all original indices.
5. Clone each copied index to its respective original name. (Cloning is a fast operation.)
6. Delete the copied indices.

The curl commands I used to test:

# reindex - see data.json below
curl -w "\n" -X POST --data "@data.json" -H "content-type:application/json" $ESHOST/_reindex
# delete the original
curl -w "\n" -X DELETE $ESHOST/logs-app-2024.02.27
# "rename" the copy
curl -w "\n" -X POST $ESHOST/logs-app-2024.02.27-temp/_clone/logs-app-2024.02.27 # untested
# delete the copy
curl -w "\n" -X DELETE $ESHOST/logs-app-2024.02.27-temp

Contents of data.json:

{
  "source": {
    "index": "logs-app-2024.02.27"
  },
  "dest": {
    "index": "logs-app-2024.02.27-temp"
  }
}

Note that I tested this using the same index pattern, so the new index was created automatically using the updated index template. To migrate as described above, we would need a new index template for the new pattern (like reindex-logs-app-*).

cloud-gov / logsearch-boshrelease