Closed markdboyd closed 8 months ago
One possible solution to this problem is to use a flattened
field type and have customers nest any custom fields they want indexed from a JSON log under a key of custom
, like so:
{"custom": {"foo": "bar"}}
You can then search for these logs in Kibana using the syntax custom.foo: "bar"
.
One problem is that it's difficult to get logging libraries to nest all of their output under a key like custom
.
Also, from my testing, something in our Elasticsearch/Logstash configuration is putting the properties under app.custom.*
, not custom
, so the flattened
field type isn't being used
In general, the approach of adding a custom index per org per day has the drawback of increasing the number of shards. And indexes/shards come with performance overhead, as described here: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/size-your-shards.html
However, that page also suggests a number of strategies we could take using the Index Lifecycle Management (ILM) policies or Curator to mitigate the impact of more indices:
Or we could do some combination of all of the above.
Another idea: we could change our indexing strategy one index per org per day to one index per org per week, which would reduce the amount of additional indices that we are adding and thus shards.
It turns out the flattened
field type seems not to be natively supported in Elasticsearch 7.9.3: https://www.elastic.co/guide/en/elasticsearch/reference/7.9/flattened.html. And I experienced issues in development when I tried to make the app
field, where custom JSON logs are expanded, into a field type of flattened
:
https://github.com/cloud-gov/logsearch-for-cloudfoundry/pull/115
In the October 2 engineering huddle, we discussed whether an indexing strategy of one index per org per day would be a good idea, since it would avoid this problem of field explosion and customers affecting each other: https://docs.google.com/document/d/1OivYiPsQdjcCuqg3sxHcLCD4ajdP-3c8K-M6hXHnXbU/edit#heading=h.2mrfeny32y80.
We decided that while logs on Opensearch is in a pre-prod state, where it is ingesting prod logs but not the system customers are using to access logs, we can evaluate whether the performance is problematic and needs mitigation
Ultimately, we have limited or no ability to address this issue in Elasticsearch given that we are stuck on an older version that doesn't supported flattened
fields and we aren't going to change our indexing strategy given the possible performance implications.
We plan to possibly implement both suggestions in a forthcoming BOSH release for Opensearch.
After some experimenting, we have determined that it makes sense to implement this in OpenSearch instead of migrating our ElasticSearch indices.
The flattened
data type is supported in our running version of ElasticSearch, 7.9.3. The previous deployment rollout failed for unrelated reasons. As of e3d834e, the app
field is mapped to type flattened
in the index-mappings-app-lfc
Component Template in our dev environment. I have successfully generated new indices using the template.
However, once an index is created from the template, Kibana starts throwing errors. Landing on the homepage shows shard errors with underlying error Field [app._keyed] of type [flattened] does not support custom formats
. The request includes fields that were dynamically mapped in older indices:
"docvalue_fields": [
{
"field": "app.@timestamp",
"format": "date_time"
},
...
Kibana makes queries to ES to populate UI elements like the list of available fields on the left-hand side of the Discover page. I believe these queries are failing because Kibana queries all indices, starting with the older, dynamically mapped indices, and ending with the newer index with the flattened
app field. It expects newer indices to contain custom-formatted fields like app.@timestamp
, and fails when the type is different.
Additionally, running a query against an app
subfield throws TypeError: Cannot read properties of undefined (reading 'timed_out')
.
This suggests that we must reindex all previously-created indices using the new index templates. I migrated one index in dev to estimate how long this would take for prod. The process was not disk or network bound. CPU utilization was in the 60-75% range; MemFree averaged 260 MiB and MemAvailable averaged 1.4 GiB (of 8 GiB total), so one reindex did not saturate CPU or memory of the three data nodes in dev. To estimate the rate of indexing:
2hr for 3 t3.large machines each with 2 vCPU to reindex 9GB data = 9GB / (2hr \* 6vCPU) = 0.75GB/vCPU-hr. (using burst credits)
Scaling to the prod load and vCPU count:
180 days of retention, 511 GB avg index size = 91980 GB total
11 r5.2xlarge, each 8 vCPU. 11 * 8 = 88.
91980 GB / 88 vCPU = ~1045 GB / vCPU. 1045 GB/vCPU / .75 GB/vCPU-hr = 1393.3 hrs
1393.3 hrs / 24 hrs/day = ~60 days to reindex prod
The actual time might be shorter since our estimate is based on ~67% CPU utilization, but other factors like system updates might cause delay, so it's a good ballpark.
While a 60-day job is not inherently prohibitive, we are hoping to launch OpenSearch in a few months, so our customers would see only a few extra months benefit for not-insignificant overhead on our end.
For posterity, if we were to reindex, the process would be:
reindex-logs-app-*
. This is the 60-day operation. (Indexes cannot be reindexed in-place, and in this case we would not want them to be, due to the different-field-types problem described above.)The curl
commands I used to test:
# reindex - see data.json below
curl -w "\n" -X POST --data "@data.json" -H "content-type:application/json" $ESHOST/_reindex
# delete the original
curl -w "\n" -X DELETE $ESHOST/logs-app-2024.02.27
# "rename" the copy
curl -w "\n" -X POST $ESHOST/logs-app-2024.02.27-temp/_clone/logs-app-2024.02.27 # untested
# delete the copy
curl -w "\n" -X DELETE $ESHOST/logs-app-2024.02.27-temp
Contents of data.json
:
{
"source": {
"index": "logs-app-2024.02.27"
},
"dest": {
"index": "logs-app-2024.02.27-temp"
}
}
Note that I tested this using the same index pattern, so the new index was created automatically using the updated index template. To migrate as described above, we would need a new index template for the new pattern (like reindex-logs-app-*
).
Background
We have a problem with multi-tenancy of customer logs within the same Elasticsearch index. Namely, that when people send JSON logs to the index with custom fields, those new log fields use up some number of the 2000 total fields allowed in the index (this is configurable)
So if customer A writes a bunch of JSON logs with custom fields early in the day, then all 2000 fields in the index may be used up. Then, when customer B tries to index logs later in the day with custom fields, those logs will be rejected and not show up in Elasticsearch/Kibana, which is what is happening to one of our customers now:
https://gsa-tts.slack.com/archives/C09CR1Q9Z/p1695238740977409
The only options to get around this seem to be:
@cf.org
field in the index name). With this approach, the custom field count is specific to each customer, which is far more flexible and unlikely to be exceeded on an individual org basis.Option 2 is by far the better solution if we can make it work. There are some logs that have no @cf.org, so perhaps they can continue to go to an index named just by the datestamp.
Resources
Acceptance criteria