elastic / go-elasticsearch

The official Go client for Elasticsearch
https://github.com/elastic/go-elasticsearch#go-elasticsearch
Apache License 2.0
5.66k stars 616 forks source link

Cannot use forward slash in document ID #52

Closed bpmoneill closed 11 months ago

bpmoneill commented 5 years ago

Issue

When adding a document to Elasticsearch (v7.0.1) using go-elasticsearch on master branch, it is not possible to use the '/' as a separator within the document ID field e.g. title/test.txt.

Sample code:

func example() {
    client, _ := es.NewDefaultClient()
    docID := "title/test.txt"
    req := esapi.IndexRequest{
        Index:      "test",
        DocumentID: docID,
        Body:       strings.NewReader(`{"title" : "test.txt"}"`),
        Refresh:    "true",
    }

    res, _ := req.Do(context.Background(), client)
    defer res.Body.Close()

    if res.IsError() {
        log.Printf("[%s] Error indexing document ID=%s", res.String(), docID)
    }
}

Using curl I can insert the document by manually encoding the '/' as below:

curl -X PUT "localhost:9200/test/_doc/title%2Ftest" -H 'Content-Type: application/json' -d' 
{ 
    "title" : "test.txt" 
} 
' 

Expected

200 OK with document added to Elasticsearch.

Actual

Error produced as below:

[[400 Bad Request] {"error":"no handler found for uri [/test/_doc/title/test.txt?refresh=true] and method [PUT]"}] Error indexing document ID=title/test.txt

Root Cause?

It would seem from looking at the code that the URL is not constructed correctly. Within esapi.request.go it constructs the request as below:

func newRequest(method, path string, body io.Reader) (*http.Request, error) {
    r := http.Request{
        Method:     method,
        URL:        &url.URL{Path: path},
        Proto:      "HTTP/1.1",
        ProtoMajor: 1,
        ProtoMinor: 1,
        Header:     make(http.Header),
    }
  ...

The path, specifically the document ID part, does not get encoded and the URL structure is not initialised correctly, meaning the necessary RawPath field is not set. In order to resolve this issue, the document ID would need to be encoded prior to use and the URL should be initialised using the url.Parse(path) correctly setting Path and RawPath fields.

karmi commented 5 years ago

Hi, thanks for the report!

I wouldn't like to use url.Parse() because I'm worried about performance implications in such a hot path, and, moreover, the newRequest() function is probably too late to do the escaping: that would escape the whole URL, including eg. the slashes which shouldn't be escaped.

I think the DocumentID — along with other string parameters which might contain reserved characters — should be escaped when the path is being constructed in the API functions, using the url.PathEscape method, like in this example. I'll have a look into fixing the code generator.

bpmoneill commented 5 years ago

Yes, definitely, only escape the DocumentID part, sorry my bug report kept referring to the full path. It was in the api.index.go where i initially added some code to escape the DocumentID but then noticed that i also needed to add it to api.delete.go. So maybe there is a common utility function required to be called from each of these api functions to perform the necessary escaping.

karmi commented 5 years ago

Right! The API implementations in the esapi package are generated by the internal/cmd/generate/commands utilities from the Elasticsearch REST specification, so it will need to handle specific path parts — such as the document ID — in a custom way, performing the URL encoding. I was thinking if it makes sense to wrap the url.PathEscape() in a helper function (unexported), but maybe it's better to be fully transparent here, for easier code reading.

bpmoneill commented 5 years ago

Hadn't realised these were generated, nice approach. It would seem possible to add this to the generator and then it will get applied to all and it would be transparent as you say. That said, I am only looking at this from the perspective of escaping DocumentID, other use cases may need to be considered making this a more complex issue to resolve.

karmi commented 5 years ago

Sorry for the long delay here — my hands are a bit tied with internal projects this month.

I was thinking about the core issue a bit more, did some code sketches, and had some discussions around it.

At this point, I'd rather not make a decision either way: I can see value both in the current implementation, where the escaping is offloaded to the underlying Go packages, as well as in changing the code to perform manual escaping and changing the way the request object is constructed in newRequest().

The tricky part of the latter approach seems to be that for it to work, both Path and RawPath fields would have to be set, and although url.Parse() can be used there, I'm worried about the performance implications and would have to carefully benchmark it, which is something I cannot jump into at the moment. Also, I'm questioning whether it's worth the effort, when the current implementation seems to correctly escape path parts, with the exception of the slash. It's a rather heavy decision to go the latter way, since it would affect backwards compatibility: as soon as document IDs are being manually urlencoded, there's no going back, and code written for a previous version would work differently with a new one.

Which brings me to the underlying question about your use case: using a value with a slash (file path, URL, ...) as a document ID. While an argument can be made that the value should be completely opaque to the client, ie. it should encode and decode anything in transport, I'm not sure if, in practice, using such a value for an ID is a good idea.

Consider eg. the "crawler" case. You crawl a website, you store web pages as documents in Elasticsearch. It's tempting to use the URL as the ID — after all, it's guaranteed to be unique, and it has semantics about it, so you can easily eg. cross-reference it from another index. But I think the temptation is misguided: it would be much better to compute eg. a SHA256 checksum of the URL, and use it as the document ID for retrieval and cross-reference. The full URL can be then a property of the document, possibly with different analyzers for different features (search vs aggregations, etc). While Elasticsearch doesn't really care about the value for ID, using something like a checksum has a nice side-effect eg. for all the various utilities which dump an index into a set of plain-text files, where the values would have to be encoded/decoded again otherwise.

So all in all, I'd keep this issue open, so we can revisit it in future, and I would suggest to use a checksum for the file path in your particular use case no matter how the implementation will end up.

engvik commented 5 years ago

Hi,

I've ran into the same issue I guess. I can index with IDs containing a / using .Bulk, but are not able to retrieve them with .Get.

Previously we've built the queries ourselves and used the ElasticSearch's REST API, but in the process of rewriting an app to Go we decided to test the client.

Are there any plans to fix this or should we go back to using the REST API directly?

engvik commented 5 years ago

Actually, after I've slept on it, I think we would have no issues with changing the IDs to not have a slash in them.

karmi commented 5 years ago

[... writing a comment as your last one arrived ... :)]

Hi, it's definitely something to be solved one way or another — I just haven't yet decided in which particular way (see my reasoning above...)

I think that the clients provides a lot of features, compared to using a "bare bones" HTTP approach, so I'm not entirely sure if handling the IDs should be the reason to use one over the other?

(...) I think we would have no issues with changing the IDs to not have a slash in them

Glad to hear that! To be honest, when I was thinking about the possible use cases, I've always came to a conclusion that using some kind of fingerprint of the value (SHA1, ...) is a much more robust approach. But it's entirely possible I'm missing some use case where that isn't true, so very curious about different approaches!

DeadlySurgeon commented 4 years ago

I know that I'm pretty late, but Is this an issue for the other ElasticSearch SDKs? If it isn't and the other SDKs escape the slash (or other nasties), then this should be fixed to keep consistency between languages.

zakk4223 commented 4 years ago

I know it's not an issue with Ruby SDK. I just ran into this issue myself and was surprised I was using the same ids in ruby vs go and getting different results.

I probably should convert to using a not-path as the id though.

technige commented 11 months ago

This is an old issue that has been investigated multiple times, but there exists no simple solution. Suggestion is to use the workaround of escaping the URL yourself.