Document an example how to extract HTTP responses during a Rally benchmark

jtibshirani commented 5 years ago

As part of developing a new search feature or optimization, I often run benchmarks by adding a new search operation to an existing rally track. I've found that it can be easy to make mistakes in setting up the benchmark:

A few times I accidentally had an error in the search operation definition, for example I misspelled the name of the field or structured the query incorrectly. Both the contender and baseline were returning no results, but I simply thought my optimization wasn't working.
Before benchmarking I write unit tests for the feature/ optimization in Elasticsearch. But it's still possible that I have an error in the search definitions so that the baseline and contender aren’t performing the same operation -- for example the contender could be returning a smaller result set.

For this use case, I wonder if it'd be useful for rally to return the number of hits from a search operation, or even the whole search response. Seeing that the baseline and contender return the same number of results (and return more than 0 results) would give confidence that I've set up the benchmark correctly and help avoid some manual testing.

danielmitterdorfer commented 5 years ago

A few times I accidentally had an error in the search operation definition, for example I misspelled the name of the field or structured the query incorrectly. Both the contender and baseline were returning no results, but I simply thought my optimization wasn't working.

By default Rally will just continue on errors and report an error rate at the end of the benchmark but I think you can already achieve what you're after by specifying --on-error=abort? It would abort the benchmark as soon as Elasticsearch returns an error and provides more context (i.e. the response body from Elasticsearch). Would that help you?

I wonder if it'd be useful for rally to return the number of hits from a search operation, or even the whole search response

Rally already returns the number of hits in the metrics metadata but you need to configure a (dedicated) Elasticsearch metrics store for that. I think it would cover most of your needs. Here is an example from our nightly benchmarks:

{
  "environment": "nightly",
  "race-id": "6cac8c34-93b4-416e-a1ea-6c4e1b87f9fd",
  "race-timestamp": "20191005T200042Z",
  "@timestamp": 1570306375355,
  "relative-time": 41959,
  "track": "geonames",
  "challenge": "append-no-conflicts",
  "car": "defaults",
  "operation": "phrase",
  "task": "phrase",
  "name": "service_time",
  "operation-type": "Search",
  "sample-type": "warmup",
  "value": 3.7241727113723755,
  "unit": "ms",
  "meta": {
    "success": true,
    "distribution_flavor": "oss",
    "source_revision": "d598083051d46e7d44f577473dde52c4a32f6776",
    "distribution_version": "8.0.0-SNAPSHOT",
    "tag_license": "oss",
    "timed_out": false,
    "took": 1,
    "hits": 58,
    "hits_relation": "eq"
  }
}

In the meta.hits property we already track the number of hits and the meta property probably contains a few more things you're interested in.

jtibshirani commented 4 years ago

By default Rally will just continue on errors and report an error rate at the end of the benchmark but I think you can already achieve what you're after by specifying --on-error=abort?

In my use case (benchmarking different search strategies) this wouldn't help, since we usually don't error on unmapped fields. For example, if you perform a term query on a non-existent field, then we return a successful response with no hits. This is also very common behavior for aggregations. It also doesn't cover the case where I've structured the query slightly incorrectly and am not performing the operation I intended. This can happen surprisingly often given the complexity of the search DSL.

Rally already returns the number of hits in the metrics metadata but you need to configure a (dedicated) Elasticsearch metrics store for that

Thanks, I wasn't aware we already tracked that information in the metadata. The next time I'm benchmarking, I will try out my workflow using a metrics store and loop back if I have further suggestions.

talevy commented 4 years ago

I, too, would like to have addition information about the response. When benchmarking new aggregations, I've sometimes accidentally benchmarked buggy implementations that returned incorrect results, sometimes no results, and showed really fast performance. Since these runs are black-boxes, it is tough to know that I am benchmarking what I intended. Maybe due to bugs in the implementation, or bugs in the rally track.

What if this search-response information came in the form of simply outputting a sample response from a single iteration. maybe with a flag like --print-sample-search-responses. This way it would be straightforward to inspect that the benchmark results are actual benchmarks of what you intended.

p.s. I had no idea about --on-error=abort! should this be the default?

danielmitterdorfer commented 4 years ago

Thanks for the feedback both. You can log full requests and responses with Rally already today. Set the log level to "DEBUG" for the logger "elasticsearch" in ~/.rally/logging.json (but beware that this will very likely cause a client-side bottleneck so double-check that you reset this again when you're interested in the performance metrics). If you then run your benchmark you'll see output like the following:

2019-12-16 14:56:08,389 -not-actor-/PID:9790 elasticsearch DEBUG > {"sort":[{"geonameid":"asc"}],"query":{"match_all":{}}}
2019-12-16 14:56:08,389 -not-actor-/PID:9790 elasticsearch DEBUG < {"took":1,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":{"value":1000,"relation":"eq"},"max_score":null,"hits":[{"_index":"geonames","_type":"_doc","_id":"Lb81D28Bu7VEEZ3mXFGw","_score":null,"_source":{"geonameid": 2986043, "name": "Pic de Font Blanca", "asciiname": "Pic de Font Blanca", "alternatenames": "Pic de Font Blanca,Pic du Port", "feature_class": "T", "feature_code": "PK", "country_code": "AD", "admin1_code": "00", "population": 0, "dem": "2860", "timezone": "Europe/Andorra", "location": [1.53335, 42.64991]},"sort":[2986043]},{"_index":"geonames","_type":"_doc","_id":"Lr81D28Bu7VEEZ3mXFGw","_score":null,"_source":{"geonameid": 2993838, "name": "Pic de Mil-Menut", "asciiname": "Pic de Mil-Menut", "alternatenames": "Pic de Mil-Menut", "feature_class": "T", "feature_code": "PK", "country_code": "AD", "cc2": "AD,FR", "admin1_code": "B3", "admin2_code": "09", "admin3_code": "091", "admin4_code": "09024", "population": 0, "dem": "2138", "timezone": "Europe/Andorra", "location": [1.65, 42.63333]},"sort":[2993838]},{"_index":"geonames","_type":"_doc","_id":"L781D28Bu7VEEZ3mXFGw","_score":null,"_source":{"geonameid": 2994701, "name": "Roc Mélé", "asciiname": "Roc Mele", "alternatenames": "Roc Mele,Roc Meler,Roc Mélé", "feature_class": "T", "feature_code": "MT", "country_code": "AD", "cc2": "AD,FR", "admin1_code": "00", "population": 0, "dem": "2803", "timezone": "Europe/Andorra", "location": [1.74028, 42.58765]},"sort":[2994701]},{"_index":"geonames","_type":"_doc","_id":"ML81D28Bu7VEEZ3mXFGw","_score":null,"_source":{"geonameid": 3007683, "name": "Pic des Langounelles", "asciiname": "Pic des Langounelles", "alternatenames": "Pic des Langounelles", "feature_class": "T", "feature_code": "PK", "country_code": "AD", "cc2": "AD,FR", "admin1_code": "00", "population": 0, "dem": "2685", "timezone": "Europe/Andorra", "location": [1.47364, 42.61203]},"sort":[3007683]},{"_index":"geonames","_type":"_doc","_id":"Mb81D28Bu7VEEZ3mXFGw","_score":null,"_source":{"geonameid": 3017832, "name": "Pic de les Abelletes", "asciiname": "Pic de les Abelletes", "alternatenames": "Pic de la Font-Negre,Pic de la Font-Nègre,Pic de les Abelletes", "feature_class": "T", "feature_code": "PK", "country_code": "AD", "cc2": "FR", "admin1_code": "A9", "admin2_code": "66", "admin3_code": "663", "admin4_code": "66146", "population": 0, "dem": "2411", "timezone": "Europe/Andorra", "location": [1.73343, 42.52535]},"sort":[3017832]},{"_index":"geonames","_type":"_doc","_id":"Mr81D28Bu7VEEZ3mXFGw","_score":null,"_source":{"geonameid": 3017833, "name": "Estany de les Abelletes", "asciiname": "Estany de les Abelletes", "alternatenames": "Estany de les Abelletes,Etang de Font-Negre,Étang de Font-Nègre", "feature_class": "H", "feature_code": "LK", "country_code": "AD", "cc2": "FR", "admin1_code": "A9", "population": 0, "dem": "2260", "timezone": "Europe/Andorra", "location": [1.73362, 42.52915]},"sort":[3017833]},{"_index":"geonames","_type":"_doc","_id":"M781D28Bu7VEEZ3mXFGw","_score":null,"_source":{"geonameid": 3023203, "name": "Port Vieux de la Coume d’Ose", "asciiname": "Port Vieux de la Coume d'Ose", "alternatenames": "Port Vieux de Coume d'Ose,Port Vieux de Coume d’Ose,Port Vieux de la Coume d'Ose,Port Vieux de la Coume d’Ose", "feature_class": "T", "feature_code": "PASS", "country_code": "AD", "admin1_code": "00", "population": 0, "dem": "2687", "timezone": "Europe/Andorra", "location": [1.61823, 42.62568]},"sort":[3023203]},{"_index":"geonames","_type":"_doc","_id":"NL81D28Bu7VEEZ3mXFGw","_score":null,"_source":{"geonameid": 3029315, "name": "Port de la Cabanette", "asciiname": "Port de la Cabanette", "alternatenames": "Port de la Cabanette,Porteille de la Cabanette", "feature_class": "T", "feature_code": "PASS", "country_code": "AD", "cc2": "AD,FR", "admin1_code": "B3", "admin2_code": "09", "admin3_code": "091", "admin4_code": "09139", "population": 0, "dem": "2379", "timezone": "Europe/Andorra", "location": [1.73333, 42.6]},"sort":[3029315]},{"_index":"geonames","_type":"_doc","_id":"Nb81D28Bu7VEEZ3mXFGw","_score":null,"_source":{"geonameid": 3034945, "name": "Port Dret", "asciiname": "Port Dret", "alternatenames": "Port Dret,Port de Bareites,Port de las Bareytes,Port des Bareytes", "feature_class": "T", "feature_code": "PASS", "country_code": "AD", "admin1_code": "00", "population": 0, "dem": "2660", "timezone": "Europe/Andorra", "location": [1.45562, 42.60172]},"sort":[3034945]},{"_index":"geonames","_type":"_doc","_id":"Nr81D28Bu7VEEZ3mXFGw","_score":null,"_source":{"geonameid": 3038814, "name": "Costa de Xurius", "asciiname": "Costa de Xurius", "feature_class": "T", "feature_code": "SLP", "country_code": "AD", "admin1_code": "07", "population": 0, "dem": "1839", "timezone": "Europe/Andorra", "location": [1.47569, 42.50692]},"sort":[3038814]}]}}

p.s. I had no idea about --on-error=abort! should this be the default?

It can happen that we intentionally push Elasticsearch to provoke e.g. bulk index rejections and we just want to record that we went too far but not necessarily abort benchmarking.

In any case I sense we should improve the docs to include examples for such scenarios?

talevy commented 4 years ago

In any case I sense we should improve the docs to include examples for such scenarios?

Knowledge of this via docs sounds like it would have solved my mistakes in usage. +1 to examples in the docs. thanks for the info!

talevy commented 4 years ago

It can happen that we intentionally push Elasticsearch to provoke e.g. bulk index rejections

is this the norm rather than the exception?

danielmitterdorfer commented 4 years ago

It can happen that we intentionally push Elasticsearch to provoke e.g. bulk index rejections is this the norm rather than the exception?

Honestly, I am hesitant to change the default behavior for this flag. It is common among load testing tools to continue on errors (e.g. JMeter does this as well by default) and in our experiments we usually want to continue a benchmark even when an error is indicated (by an error rate > 0 in the report), e.g. if one out many benchmarked queries fails e.g. due to changed syntax or tighter restrictions all of the other benchmark results are usually still fine and we would not want to lose them (btw, this is the reason we run Rally's integration tests in CI with the stricter --on-error=abort to spot such issues before the problem occurs in the nightly benchmarks). However, we then also investigate and try to eliminate / understand the cause.

danielmitterdorfer commented 4 years ago

"Implementation" notes: We should add a new page to our documentation that describes tips & tricks when benchmarking with specific examples (in a "recipe" style). The purpose of this example is to explain how to get request and response when running a benchmark with Rally (see my comment above how it's done). We should also mention the caveats (debug logging should not be enabled in a "real" benchmark).

gingerwizard commented 4 years ago

Propose adding the following to recipes:

Identifying when errors have been encountered

Include options such as configuring a metrics store and how to determine if errors were encountered.
Why we don't fail on default but how this behaviour can be induced.

How to capture query responses and hit counts

Setting debug level, emphasising this may impact the efficacy of benchmarking metrics.
Configuring a metrics store and how to determine hit counts

elastic / rally

Document an example how to extract HTTP responses during a Rally benchmark #791