Open benwtrent opened 3 years ago
Pinging @elastic/es-data-management (Team:Data Management)
This looks REALLY suspicious: https://github.com/elastic/elasticsearch/blob/e801035acb293c7d2c6f2a8d38b5396dce3312bc/server/src/main/java/org/elasticsearch/ingest/IngestService.java#L507
If the index request is the same object as is in the danged bulk request, no wonder the pipeline goes away, we set it to NOOP.
@benwtrent - can you retry your test with a default pipeline and/or a final pipeline ? I am curious if that could be a work around.
I definitely can tomorrow. Silently ignoring something is pretty bad
Silently ignoring something is pretty bad
Agreed. However, I think the only way an individual bulk item can return a 429 is if the ingest processor does this and I think enrich is the only processor that can do this when ingest rate dramatically outpaces the ablity to execute the underlying search for enrichment. In this case it would have to do this via reindex workflow (I can not think of other workflows that would re-use this an internal client). Normally a 429 would apply to entire bulk, and unsure if this bug presents in that case (if so, that is worse).
@jakelandis
Both final_pipeline
and default_pipeline
are also silently ignored. I tried the same scenario (processor always throwing a 429) with both and the document only hits the pipeline once, and then is indexed by passing the pipelines.
I think the only way an individual bulk item can return a 429 is if the ingest processor does this
I would also expect us to return 429 on individual items if ~the coordinating node or~ the primary had to push back because its write threadpool queue was full or it hit indexing_pressure.memory.limit
.
(edit: actually not sure if the coordinating node can reject individual items, but definitely the primary can)
@jakelandis enrich isn't the only processor (at least soon) that throws a 429 on individual requests: https://github.com/elastic/elasticsearch/pull/78757
Thanks for the additional context and testing. This really helps to evaluate the impact of the bug.
@jakelandis Any idea if this bug can be prioritised for 8.0? This becomes a pretty big failure for reindexing with inference ingest processors that can sometimes timeout due to resource (CPU) constraints. It's a scenario we'd like to have a good story around for GA with the new NLP/PyTorch model inference feature.
Elasticsearch version (
bin/elasticsearch --version
): 8.0.0 (probably earlier)Description of the problem including expected versus actual behavior: When an individual bulk action fails due to 429 it should still use the original indexing pipeline on retry.
Instead, the pipeline (at least during
reindex
) is dropped from the re-created bulk request.Steps to reproduce: I have a local WIP processor that is throwing a 429 manually to recreate.
Provide logs (if relevant):
I added a bunch of logging lines (and the pipeline attached to the index request in the output) and here is what I got:
You can see from the first line
[V_JiV3wBwot2XBnG8O0d][classification]
pipelineclassification
was used.Then in the second line (this method, I log the
currentBulkRequest
), the pipeline from the request is gone[V_JiV3wBwot2XBnG8O0d][_none]
https://github.com/elastic/elasticsearch/blob/68817d7ca29c264b3ea3f766737d81e2ebb4028c/server/src/main/java/org/elasticsearch/action/bulk/Retry.java#L130