elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.11k stars 24.83k forks source link

uri_parts ingest processor should not URLdecode path and query #112904

Open julthomas opened 1 month ago

julthomas commented 1 month ago

Elasticsearch Version

8.15.0

Installed Plugins

No response

Java Version

bundled

OS Version

RockyLinux 9

Problem Description

URLencoded characters in query and path properties should not be decoded because it modifies the real URL :

Ones could argue decoding of the path is more suject to discussion... However, as a rule it should be possible to recreate the original URL by reassembling the parts resulting of the processor. Also for consistency, it is bad idea to have different reasoning for defferent parts or the URL. Currently the processor is mixing parsing and URLdecod'ing whereas its goal is to parse (extract parts). Decoding percent-encoded sequences in query may return illegal characters, hence making real (as in proper) decoding of query parameters impossible.

I believe getRaw functions (https://docs.oracle.com/javase/8/docs/api/java/net/URI.html) should be used in main/java/org/elasticsearch/ingest/common/UriPartsProcessor.java.

Sample 1

Input URL : http://www.acme.com/some/thing?a=123&b=x%26c%3dy Should give query : a=123&b=x%26c%3dy (which decode in a 123, b x&c=y) But currently gives query : a=123&b=x&c=y (which would decode in a 123, b x, c y)

Sample 2

Input URL : http://www.acme.com/some%2fthing Should give path : /some%2fthing Currently gives path : /some/thing

Steps to Reproduce

- uri_parts:
    if: 'ctx.url?.original != null'
    field: url.original
    target_field: url
    keep_original: true

With inputs :

{"url":{"original":"http://www.acme.com/some/thing?a=123&b=x%26c%3dy"}}
{"url":{"original":"http://www.acme.com/some%2fthing"}}

Logs (if relevant)

No response

elasticsearchmachine commented 1 month ago

Pinging @elastic/es-data-management (Team:Data Management)