[ENHANCEMENT] Bulk API helper functions

russcam commented 4 years ago

The bulk API can be used to index multiple documents into Elasticsearch by constructing a bulk request containing multiple documents, and executing against Elasticsearch. When the number of documents is large however, a consumer needs to construct multiple bulk requests, each containing a slice of the documents to be indexed, and execute these against Elasticsearch.

Many of the existing Elasticsearch clients provide a "bulk helper" for this purpose. The helper can be:

provided a large collection of documents: this could be as a stream, lazy iterable, etc.
slice the collection into "chunks": this could be by number of documents or by request byte size
execute multiple concurrent requests against Elasticsearch to index documents
optionally backoff and retry indexing documents that fail to be indexed signalled by a 429 Too Many Requests HTTP response.

An example helper is the BulkAllObservable from the C#/.NET client.

The Rust client should provide a similar, idiomatic way of helping consumers bulk index a large collection of documents.

Stargateur commented 3 years ago

I don't even understand how to do a bulk...

body.push(json!({"index": {"_id": "1"}}).into());
body.push(json!({
    "id": 1,
    "user": "kimchy",
    "post_date": "2009-11-15T00:00:00Z",
    "message": "Trying out Elasticsearch, so far so good?"
}).into());

// add the second operation and document
body.push(json!({"index": {"_id": "2"}}).into());
body.push(json!({
    "id": 2,
    "user": "forloop",
    "post_date": "2020-01-08T00:00:00Z",
    "message": "Bulk indexing with the rust client, yeah!"
}).into());

let response = client
    .bulk(BulkParts::Index("tweets"))
    .body(body)
    .send()
    .await?;

so "_id" is set to "1" (not zero.... anyway I past over non sense now), then "_id" because "id" in the document, why repeat the information twice ? why it's take two line to do one operation ? Then Index is "tweets". so no id or _id ? I don't understand a thing of all of this, the doc is unreadable.

Reading, https://www.elastic.co/fr/blog/what-is-an-elasticsearch-index:

What exactly is an index in Elasticsearch? Despite being a very basic question, the answer is surprisingly nuanced.

nuanced...

An index is like a ‘database’ in a relational database. It has a mapping which defines multiple types. An index is a logical namespace which maps to one or more primary shards and can have zero or more replica shards.

"like"

the following line don't make any sense to me. There is no mention to any "id" or "_id", no clear link to where to find information.

so what doesn't this mean, do I need to create an id ? do I need to make it unique ? do I really need to put in the real document ? can I use a integer ? is there a limit of operation ? so many question. If I need to write myself 99% of the http I don't see why this crate is for.

I will for sure find the information somewhere already find some but this should be clear, for now I must play cluedo each time I have to deal with anything related to ES

Work with elastic search is painful !

russcam commented 3 years ago

@Stargateur In answer to your questions

so "_id" is set to "1" (not zero.... anyway I past over non sense now)

You can set "_id" to whatever string value you want, including "0". You can also pass a numeric value and it'll be coerced to a string value.

then "_id" because "id" in the document, why repeat the information twice ?

"_id" is the id for the document and will form part of the document metadata. When you want to retrieve a document by id, this is the id. "id" in the document is simply another field in the document. It isn't necessary to have "id" in the document as "_id" is part of the document metadata, but if you're modelling documents with structs, it might be convenient to also persist the id value in a field so that it is deserialized onto the struct id field when deserializing documents, because the source JSON document is returned in the "_source" field and document metadata is returned in other fields.

why it's take two line to do one operation ?

The bulk API expects newline delimited JSON where the operation to perform and the optional document involved in the operation are on consecutive lines. For example, a bulk index operation

{"index": {"_id": "1"}}
{ "id": 1,"user": "kimchy","post_date": "2009-11-15T00:00:00Z","message": "Trying out Elasticsearch, so far so good?"}

The first line contains information about the operation, the second line is the document. Take a look at the bulk API documentation for more details.

Reading, https://www.elastic.co/fr/blog/what-is-an-elasticsearch-index:

What exactly is an index in Elasticsearch? Despite being a very basic question, the answer is surprisingly nuanced.

nuanced...

An index is like a ‘database’ in a relational database. It has a mapping which defines multiple types. An index is a logical namespace which maps to one or more primary shards and can have zero or more replica shards.

"like"

The blog post is from 7 years ago and the database analogy has been retired as it isn't a great one. There's a better overview on the website.

An index in Elasticsearch is a collection of documents. An index has a mapping, either explicitly defined or implicitly inferred from the documents indexed into it, that indicates how fields in documents are mapped in Elasticsearch. For example, the "user" field in a tweet document might be mapped as a keyword field for structured search, sorting and aggregating.

A running instance of Elasticsearch is typically referred to as an Elasticsearch cluster. A cluster is made up of one or more nodes. An index may persist data across more than one node for high availability, fault tolerance, etc. An index is made up of one or more shards, and it is these shards that may be spread across more than one node.

so what doesn't this mean, do I need to create an id ? do I need to make it unique ? do I really need to put in the real document ? can I use a integer ?

You don't need to create an id for a document, in which case Elasticsearch will generate one for the document upon indexing it. This can be a suitable approach for many types of data. You typically want to assign an id to a document however in cases where you want to identify and operate on a specific document. The id must be unique - if you index a document with the same id, the document will overwrite a previous document of the same id. You can use an integer for the id.

is there a limit of operation ? so many question.

The bulk API sends multiple documents in one HTTP request, so you typically want to keep the overall size of one bulk request to a reasonable size, like ~5Mb. If you have lots of documents to index, you may want to send multiple concurrent bulk requests, which is what this issue is to discuss.

If I need to write myself 99% of the http I don't see why this crate is for.

You're free to choose whether to use this crate or not. There are many reasons why you might want to though, one of which is that it provides functions for all stable Elasticsearch APIs.

I will for sure find the information somewhere already find some but this should be clear, for now I must play cluedo each time I have to deal with anything related to ES

The best place to ask questions is the discuss forums, which has a lot of active community members willing to share knowledge and help with issues. There's also the webinars on elastic.co and the reference documentation.

Stargateur commented 3 years ago

thank a lot all is way more clear, your answer will help a lot of people I think.

The link I found was the top link to show in my search and looked official, I didn't expect the article was obsolete. that totally my bad.

That said, if I understand this mean the body is not valid json, why make life complicated, just:

[{
    "op": {
        "index": {
            "_id": "1"
        }
    },
    "content": {
        "id": 1,
        "user": "kimchy",
        "post_date": "2009-11-15T00:00:00Z",
        "message": "Trying out Elasticsearch, so far so good?"
    }
}]

You're free to choose whether to use this crate or not. There are many reasons why you might want to though, one of which is that it provides functions for all stable Elasticsearch APIs.

No I'm force to. I didn't wish to work with es if it was my choice I would not, rs-es is full of bug and not async, this crate is still on tokio 0.2, and I didn't find anything else decent in Rust. And I'm not talking about problem we have with the server. The only reason we use ES is that our R&D team want to for reason I don't agree at all.

Sorry, this has to come out. ES doing breaking change in minor version (AFAIK), https://github.com/benashford/rs-es/issues/148, forcing me to make random patch on a crate I don't know anything on a tech I don't know much, were the doc is hard to find that has so many problem make me angry. So, after my first try there is 6-7 months, I try now to use the official client for the second time. And again for the second time, I want to break my home.

That said, https://docs.rs/elasticsearch/7.10.0-alpha.1/elasticsearch/struct.BulkOperation.html#method.create is probably what I need.

The bulk API sends multiple documents in one HTTP request, so you typically want to keep the overall size of one bulk request to a reasonable size, like ~5Mb. If you have lots of documents to index, you may want to send multiple concurrent bulk requests, which is what this issue is to discuss.

How can I know how many bytes my data will take ? If I understand correctly it's the crate that handle the json conversion, "reasonable" is not an acceptable answer for an API. Put a magic number in my code, like 5000 items by bulk without known if one day, my bulk will be too big will for sure one day run happen. I could maybe get access to the average size of my document in mongo and so estimated my max items, but that also could blown up.

russcam commented 3 years ago

Take a look at tune for indexing speed for how to size bulk requests.

Stargateur commented 3 years ago

Seem to work quite nicely:

it's a little strange Bulk*Operation don't directly implement Body
after 1000 items bulk I didn't see improvement, it was my previous default with rs-es so I guess right the first time
the doc lines "Creates a new instance of a bulk index operation" are useless
I did have to make the deserialization of the response myself:
```
#[derive(Debug, Deserialize)]
struct BulkResponse {
  took: u64,
  errors: bool,
  items: Vec<serde_json::Value>,
}
```
items is a "quick and dirty" way to count the number of item I inserted. I didn't see any count in the doc. It would be better if there was a structure that handle answer correctly directly in this crate.

hackermondev commented 1 year ago

Is there any progress regarding the original issue? My objective is to stream a File for bulk indexing in Elasticsearch, where the file already contains all the correctly formatted documents to be indexed. It's worth noting that Reqwest, which is internally used by elasticsearch-rs, always supports From for the Body type. However, the current implementation of Body in elasticsearch-rs reads the entire request body into memory before performing the POST request, which is not the desired behavior in my case.

VimCommando commented 1 month ago

Landed here with the same use-case: I want to stream a Vec<serde_json::Value> that has all my pre-formatted lines for the bulk API.

elastic / elasticsearch-rs

[ENHANCEMENT] Bulk API helper functions #62