Closed knocte closed 7 years ago
@knocte Would your use-case be solved if the response of the Bulk API contained a field which showed the number of failed insertions? This way you will not need to parse the whole response if all are successful, you can just check the value of this field is 0. If the value is not 0 then you can parse the rest of the response to find the cause.
@colings86 we already have an error flag in the bulk response, but there is something to be said for returning a minimal body if all requests succeeded.
I wouldn't be in favour of not reporting failures though, and given that requests are ordered, that implies the need to return all responses.
Sorry, I no longer use ElasticSearch
So maybe we should have an option in the request to send back a minimal response if there are no errors and a full response if there is even 1 error.
I would be in favor of such a feature. For example, if I use the search API and I execute a query and I'm only interested in the number of hits, not the actual hits as well, I add ?search_type=count and I don't get back the possibly long list of hits which don't actually interest me.
Maybe add an attribute for _bulk - something like "response_type". With three values: "full" - to get back what we get today for _bulk, "partial_with_failures" - minimal response if there are no errors and a full response if there is at least one error, "minimal" - minimal response no matter if there were errors or not.
I don't like the minimal
response in case of errors, but I do like an optional short response if there were no errors.
Very nice, this is really overdue. Have you thought about returning only failed docs in the response? Or even just ordinal numbers of failed docs. I assume people often want to further treat unsuccessful docs in some way, e.g. retry after modifications, store unstructured, ...
Also, instead of returning 200
or 201
, there are perhaps more suitable response codes in this situation, which would make the flag check inside the response obsolete.
Perhaps 207
or 422
, depending on whether you want to hint success or failure, respectively.
The 207 (Multi-Status) status code provides status for multiple independent operations (see Section 13 for more information). A Multi-Status response conveys information about multiple resources in situations where multiple status codes might be appropriate. The default Multi-Status response body is a text/xml or application/xml HTTP entity with a 'multistatus' root element. Further elements contain 200, 300, 400, and 500 series status codes generated during the method invocation. 100 series status codes SHOULD NOT be recorded in a 'response' XML element. Although '207' is used as the overall response status code, the recipient needs to consult the contents of the multistatus response body for further information about the success or failure of the method execution. The response MAY be used in success, partial success and also in failure situations. The 'multistatus' root element holds zero or more 'response' elements in any order, each with information about an individual resource. Each 'response' element MUST have an 'href' element to identify the resource. MULTI_STATUS(207)
The 422 (Unprocessable Entity) status code means the server understands the content type of the request entity (hence a 415(Unsupported Media Type) status code is inappropriate), and the syntax of the request entity is correct (thus a 400 (Bad Request) status code is inappropriate) but was unable to process the contained instructions. For example, this error condition may occur if an XML request body contains well-formed (i.e., syntactically correct), but semantically erroneous, XML instructions.
+1
It has been over 3 years guys, any hope to have this implemented. Bulk API is quite memory hungry because of this issue.
The adoptme
and low hanging fruit
mean the repository owners would love to review and merge a pull request that implements the feature but it isn't anyone's priority. If anyone wants to write a PR for this I'll certainly review it and, once we get to the other side of the review process, merge it.
I'll just go do it.
I'll make an option that will just count successful operations rather than returning the whole result for it. Any failures will still come back because you can act on them.
Thanks Nik!
sorry for chiming in late, especially in the context of #17932. I am not a fan of this change, a short response format breaks the structure of how we return responses for bulk requests, and no longer relies on position within an array to match response to request. It means special code handling, different one, depending on the parameter of the request, which is not user friendly.
I also challenge the fact that parsing the response of a bulk requests is "heavy". Compared to setting up the requests to be indexed (which is the "other" client side work), I am very surprised that parsing the response takes so long as a whole compared to the whole bulk execution. This is definitely not the case in Java.
I would be ok with a flag that will simply not return anything except for top level summary fields if everything is successful. This is probably the most common case, and if something failed, just use the current way to correlate request/response.
For what it is worth I don't have a strong opinion on how this should comes out or even if we merge #17932. I just implemented my first instinct and my first instinct is rarely right. My only strong opinion is "if Elasticsearch doesn't want any sort of short format on the bulk response then we need to close this issue".
I would be ok with a flag that will simply not return anything except for top level summary fields if everything is successful. This is probably the most common case, and if something failed, just use the current way to correlate request/response.
+1
I would be ok with a flag that will simply not return anything except for top level summary fields if everything is successful. This is probably the most common case, and if something failed, just use the current way to correlate request/response.
I think the summary should be the default, and the current response should be behind a verbose flag. Otherwise +1.
I think the summary should be the default, and the current response should be behind a verbose flag. Otherwise +1.
This would be a breaking change, so best to keep the default as it is today.
The default should always be the most common case. In addition, the current, default behavior requires more CPU, memory, and network resources which seems like a bad thing. Besides aren't breaking API changes the norm for this project? It seems like every major version, sometimes even minor versions, have major changes to the API.
@rpedela We think long and hard about every breaking change that we make, and try to find ways to smooth the migration. This is a good example of something that we shouldn't break.
Rediscussed this in FixItFriday. We're loathe to change the API (which pushes complexity towards the clients and runs the risk of bugs) without any benchmarks demonstrating how much performance gain there is from making this change. The bulk response is very compressible so we could well find that leaving out the bulk items make a negligible difference.
I'm going to close this for now, but feel free to reopen if you can show a significant difference in performance when compression is enabled.
@kimchy @nik9000 I came across this thread while researching bulk api requests format. It's worth mentioning here that bulk api documentation doesn't mention that request/response are correlated by array order. Would you mind adding that as an API guarantee?
I use in ES a production environment which indexes hundreds of billions of documents, and we retry individual documents that fail via the bulk API. Since the documentation doesn't guarantee ordering of the response, we had been matching request/response items via index/type/id, but there is a subtle problem with that approach. We use index aliases, and when you do that, the request will have the alias, but the response has the physical index name.. a mismatch occurs.
@garakelian I believe that's covered here:
When the bulk API returns, it will provide a status for each action (in the same order it was sent in) so that you can check if a specific action failed or not.
@jasontedor Thanks for quick reply. I missed that page, thank you. Do you know if a PR request would be welcomed to update the documentation on the other page I referenced, (to point to the page you linked), and/or add the note about the index name being different on the reply when aliases are being used?
An update to the bulk API documentation that you referenced would be welcomed (I would prefer explicitly mentioning the API guarantee that we provide, rather than linking elsewhere in our docs).
As for the response when indexing using an alias, I think that is covered in the alias docs (if it's not, it should be here), but I don't think that the bulk API docs are the right place (should be more general).
Does that make sense?
@jasontedor Sure. I'll submit a PR based on your suggestion... Thanks.
Thank you @garakelian.
@clintongormley, I am not sure if the performance would get great improved by short return message, but I am sure it will save a lot of money paid for the network traffic when using in Cloud environment. For most Cloud platform, the inbound traffic is free and only charge the outbound traffic. When we deployed a cloud based ES system, and the clients call bulk api, for the cloud ES, the request from client is an inbound traffic, that is free, but the response to client is outbound traffic which needs to be paid at some rate.
I came across this thread while looking for a way to reduce the response from Bulk API. A few days later, I discovered the filter_path parameter that drastically reduced the response size for my large bulk requests. This seems to have existed since 2015.
Is there any reason using filter_path wouldn't solve the problems discussed in this thread?
@wnojopra I stumbled upon this issue because I have a timeout problem using Bulk API that appeared after I enabled SSL on ElasticSearch 8.5
At some point ES does not sent the last few bytes of the huge response (~50k docs per request) thus it hang until it reached the timeout, retry, and create duplicates since I relied on ElasticSearch to generate the IDs.
Long story short, I stumbled upon ?filter_path=took,errors,items.*.error
to return only items in errors and it "solved" the issue in a way as it heavily reduce the response size but in introduce new challenges:
1) The BulkProcessor does NOT expect missing values when building BulkResponseItem (ex: missing shard, etc.) and create a runtime error (null value whereas an int
was expected for ex)
2) The per-doc retry logic of the BulkProcessor rely on the position of the item in error in the response to find the corresponding request. Breaking the retry logic.
An easy workaround for 2) would have been to have, in the response, the position of the item within the request
When using the bulk API with a lot of documents (say 100K), it's kind of pointless to get a response for each of those documents because the response is huge! Cannot we have a way to call this API and only receive a response that is either:
I know that I could probably use Bulk UDP (and I'll probably give it a go soon), but UDP doesn't receive any response whatsoever, so it's like having black&white, but not any shade of grey.
Thanks!