sparse fields track - Githubissues

LucaWintergerst commented 7 years ago

We should consider adding a track with very sparse fields, comparing how doc_values behave over time. This is particularly interesting once we move to lucene 7.

The run from14:30-16:30 with doc_values: false, The one from 16:30-17:30 with doc_values: true, The very last run is with doc values disabled for all fields

The data has around 2200 fields in total, split across 30 types. The one type with the most documents has 200-300 fields. The decrease in performance comparing the first two runs is significant, with around 30-40%. Furthermore the indexing rate also keeps slowing down, the more data gets indexed, which does not happen as much when doc_values are disabled (see run 1)

This test was run on the following hardware: 14 cores 4 SSDs, multiple data paths 30GB heap size

The cluster was CPU bound during all runs.

The merges can't keep up in the second run and "indexing throttled" messages were showing up in the logs

LucaWintergerst commented 7 years ago

after re-running the test with with just one type and 200 fields, the indexing rate did not change significantly. Almost no change was visible in monitoring. Unfortunately I don't have exact numbers.

jpountz commented 7 years ago

@LucaWintergerst FYI Elasticsearch master is on Lucene 7 since Tuesday.

The Nested track has sparse fields by design due to the use of nested fields: fields that exist in the parent do not exist in children and vice-versa. The geonames dataset also has the elevation field which is only present in 26% of documents.

I'd be fine with adding a track that has more sparse fields, but only if it is realistic. Our recommendation still is and will always be to model documents in a way that fields are dense whenever possible.

LucaWintergerst commented 7 years ago

I do understand and fully support our stance against sparse fields and therefore our recommendation for dense documents, but from what we see our customers do this does not always apply. Often times the source data only contains a subset of the fields which are possible to appear in a document. While it is certainly possible to model the data or indices to counteract sparsity, it creates and additional overhead that a user might not be willing to pay without understanding why he should even care.

It can also be hard to defend this stance without having credible data in the form of benchmarks to convince the user otherwise. How bad is sparsity really? How much does it impact indexing, searching, index size and so on? I'm sure that you (or we) can answer these questions but I would still like to have data that I can show people to convince them otherwise. Most users don't even know about sparsity until we tell them.

cdahlqvist commented 7 years ago

@jpountz @LucaWintergerst @tsg I think this is a very important benchmark due to how Beats currently organises data. Metricbeat stores data related to all types of metrics in a single index, where each metric type has a prefix. As far as I know the standard Metricbeat index template has well over 1000 fields defined and is probably just going to grow when new types of metrics are introduced. This type of data is likely to be generated at scale and will be sparse by design. The same also applies to Filebeat, which sends logs as well as output from its pre-configured modules to a single index as well.

jpountz commented 7 years ago

My only ask is to keep the track realistic. :) Also maybe there are things to reconsider on the beats side to create fewer sparse fields.

tsg commented 7 years ago

What would you think about adding a track with the data created by Metricbeat in its default configuration? We're working on improving the default configuration for 6.0, so I'd wait for that before doing it, but otherwise it seems to me like a pretty logical choice?

Also maybe there are things to reconsider on the beats side to create fewer sparse fields.

It is possible and fairly easy to configure Metricbeat to create one index per module, in which case the data should be a lot more dense. But we thought that the drawbacks of doing that (more complicated index management for the user, potential shards explosion) are too big to make it the default. I'd be curious to know your thoughts on that.

I guess same happens in many logstash deployments as well, since multiple log types typically go into the same logstash-* index pattern.

pcsanwald commented 6 years ago

@tsg I'd be up for doing the work on the rally side to add this track: I'm benchmarking a new aggregation and looking around for a dataset that contains sparse values, so this kind of data would be potentially quite useful. The thing I'd need is a substantial amount of metricbeat data to use for the track: any thoughts here?

elastic / rally-tracks

sparse fields track #19