elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
68.81k stars 24.45k forks source link

Painless - Request for native String split function #20952

Closed gingerwizard closed 5 years ago

gingerwizard commented 7 years ago

Describe the feature: String.split is not exposed for performance. Can achieve using regex but this requires the user to enable regexes for what is a rather common use case.

Can we have a native split that only supports a char sequence and not a regex?

clintongormley commented 7 years ago

@jdconrad what do you think?

jdconrad commented 7 years ago

I think it's fine to add a native method for this.

eliranmoyal commented 7 years ago

Hey , any news on that? and for now how can i achieve something similar to "doc['myField'].value.split(' ')[1]" using the regex class? Thanks!

jdconrad commented 7 years ago

@eliranmoyal Apologies, I haven't had time to work on this, yet. @rjernst What do you think the best way to add a String split method would be for this?

rjernst commented 7 years ago

I'm apprehensive about this because I think it will cause confusion if we simply add a split method available for String with different semantics than javadocs state.

nik9000 commented 7 years ago

I'm apprehensive about this because I think it will cause confusion if we simply add a split method available for String with different semantics than javadocs state.

Indeed. If we make one I think it'd be a poor choice to call it split because of the one in java. Too confusing.

jdconrad commented 7 years ago

@rjernst @nik9000 Good point about the naming being confusing. Would either of you be in support of having some kind of utils method still? It might be nice to be able to split on a char if regexes are disabled.

nik9000 commented 7 years ago

I'd be fine with something like "foo bar".splitOn(" ")

On Fri, Dec 2, 2016, 4:41 PM Jack Conradson notifications@github.com wrote:

@rjernst https://github.com/rjernst @nik9000 https://github.com/nik9000 Good point there here. Would either of you be in support of having some kind of utils method still? It might be nice to be able to split on a char if regexes are disabled.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/elastic/elasticsearch/issues/20952#issuecomment-264569905, or mute the thread https://github.com/notifications/unsubscribe-auth/AANLokgcgVwRiW4eHz0eEBr133vaPNZVks5rEJB3gaJpZM4KXiQl .

gingerwizard commented 7 years ago

What about a tokenize method?

jdconrad commented 7 years ago

@gingerwizard Already whitelisted --

class StringTokenizer -> java.util.StringTokenizer extends Enumeration,Object {
  StringTokenizer <init>(String)
  StringTokenizer <init>(String,String)
  StringTokenizer <init>(String,String,boolean)
  int countTokens()
  boolean hasMoreTokens()
  String nextToken()
  String nextToken(String)
}

I know it's not a method, but it is a way to do this without using regexes. Maybe this is the most practical solution to split?

eliranmoyal commented 7 years ago

The tokenizer don't work like split (on multiple char split), the string argument it accepts is a multi delimiters tokens. meaning "he*y*+bye".split("*+") will result in ["he*y","bye"] while StringTokenizer("he*y*+bye","*+") after iterating will result in ["he","y","bye"]

caioflores commented 6 years ago

Any update on it? I was trying to use the method today and found this issue.

jdconrad commented 6 years ago

We are currently working on creating a whitelist per script context and will re-address this issue when that work is completed. For now the easiest way to do this is to use regexes.

karmi commented 6 years ago

Voting for having some kind of split implementation, for something like ...hostname.split('.')[0], to get a subdomain from full URL eg. in aggregations, or transforming payloads in Watcher.

tony0918 commented 6 years ago

Is that possible to apply a regex pattern to a field value to get substring extracted?

jdconrad commented 6 years ago

We could still possibly as a helper method as an augmentation.

jdconrad commented 6 years ago

Discussed with @rjernst. We will add an augmented method for String with the signature like String[] splitOnToken(String). This will be usable without requiring regexes to be enabled.

ypid-geberit commented 5 years ago

I stumbled across this and wrote a Painless version of splitOnToken which is based on the String split implementation of openjdk. This can be used in Painless scripts until splitOnToken is released. Hope this helps. I tested it on Elasticsearch 6.2.4.

List splitOnToken(String input, char ch, int limit) {
  /* Painless does not have String split so we reimplement it here.
   * Ref: https://github.com/elastic/elasticsearch/issues/20952
   * Based on: https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/lang/String.java
   * License: GPL-2.0-only
   * It returns a ArrayList instead of a String[] because list.toArray does not seem to be whitelisted in Painless.
   * Example: splitOnToken("gnu.example.org", (char)'.')[0]
   *
   * Note that this code part has been removed to make it behave like Python:
   *
   *  int resultSize = list.size();
   *  if (limit == 0) {
   *    while (resultSize > 0 && list.get(resultSize-1).length() == 0) {
   *      resultSize--;
   *    }
   *  }
   *  return list.subList(0, resultSize);
   *
   * Without this code part at the end, "gnu.example." will return ["gnu", "example", ""] as you would expect.
   * In Java it seems to be more common to do: "gnu.example." -> ["gnu", "example"].
   */
  int off = 0;
  int next = 0;
  boolean limited = limit > 0;
  ArrayList list = new ArrayList();

  while ((next = input.indexOf(ch, off)) != -1) {
    if (!limited || list.size() < limit - 1) {
      list.add(input.substring(off, next));
      off = next + 1;
    } else {
      list.add(input.substring(off, input.length()));
      off = input.length();
      break;
    }
  }

  if (off == 0) {
    list.add(input);
    return list;
  }

  if (!limited || list.size() < limit) {
    list.add(input.substring(off, input.length()));
  }

  return list;
}

List splitOnToken(String input, char ch) {
  return splitOnToken(input, ch, 0);
}

with return list;.

With this, "gnu.example." will return ["gnu", "example", ""] as you would expect (from languages like Python). In Java it seems to be more common to do: "gnu.example." -> ["gnu", "example"].

Edit: I modified the function to return a more reasonable ["gnu", "example", ""].

Also note that I don’t clam any copyright on this modified work. So only the original copyright combined with the license (GPL-2.0-only) is relevant if you want to use the code.