dart-lang / sdk

The Dart SDK, including the VM, JS and Wasm compilers, analysis, core libraries, and more.
https://dart.dev
BSD 3-Clause "New" or "Revised" License
10.29k stars 1.59k forks source link

Allow some `String` operations to work on substrings. #38539

Open lrhn opened 5 years ago

lrhn commented 5 years ago

The current indexOf and lastIndexOf operations search forward and backwards from a given start position, but there is no way to limit the end position.

If, for example, you want to find two delimiters, then look for something between them:

var tagStart = string.indexOf("<");
var tagEnd = string.indexOf(">", tagStart + 1);
var tagNameEnd = string.indexOf(" ", tagStart + 1);
if (tagNameEnd > tagEnd) tagNameEnd = -1;

Here, I want to parse an HTML tag (bad idea, I know) and look for the space after the tag name, if any. However, a tag like <br> won't have that space, and I'll just search into the potentially very long part of the string after the tag end marker.

If I had a way to delimit the search further, perhaps and optional endIndex parameter for indexOf, I could just write var tagNameEnd = string.indexOf(" ", tagStart, tagEnd) and be sure that I don't get any match of of " " that isn't in the substring string.substring(tagStart, tagEnd).

The lastIndexOf is a little more complicated because the existing optional "start" position is actually the end of the searched area, and not event that, it is the top limit on the start of the string that is found, so the match can extend beyond that index. It might be easier to add a second method lastIndexInRange(start, end, pattern), and we could have indexInRange(start, end, pattern) as well.

Operations that would be nice to have on substring without having to allocate the substring object are:

Many these operations create a new string based on an existing string. If the existing string happens to be a substring of another string, you need to create the intermediate substring in order to do the operation only on the part that you care about. Then the intermediate string is thrown away. It would be more efficient if you could do the operation directly on a substring of an existing string.

Alternatively, and more ambitiously, we could add a slice method to String which returns an object representing a substring, with all the same methods as String. It should probably even implement String. The important part is that it is guaranteed efficient. It does not copy the contents of the original string (unless that's cheaper than allocating an object containing a string pointer and two integers - which can likely be 30 bit positive integers without any issue, so storable in two 32-bit smis or one 64-bit smi).

Then you could do string.slice(0, tagEnd).indexOf(" ", tagStart) to limit your search by tagEnd.

I remember Josh Bloch stating that making the substring operation in Java cheap gave a better API. It meant that you didn't need all these extra indices on operations. You made the API to apply to the entire string, and if users wanted to work on a substring, they simply created a substring, which didn't require copying the original string.

Such a string slice is also nice if you are modifying a string more than once. A user-implemented String.replaceAll function would include a number of substrings of the base string between the replacements. Those intermediate strings are not needed, and using cheap slices would likely be more efficient (this is what the library version does internally, but it can only do that because it can write directly to the string being created). A library String.concat which takes a list of strings, which might be slices, and does an efficient concatenation without unnecessary intermediate allocations, would make such thing achievable by to users.

Compilers should generally be able to inline the object allocation of the slice. Strings are sealed, so it's easily recognizable when you know the type.

The slice operation here would be equivalent to that. JavaScript is simple because most JavaScript implementations already do substring lazily.

eernstg commented 5 years ago

Having an efficient substring version of String sounds really attractive. How much more ambitious would it be than adding all those "end index" elements to the current API? ;-)

rakudrama commented 5 years ago

There are two design points that could work with JavaScript:

  1. The slice type implements String but is otherwise private. In JavaScript it is a String. The String type therefore still works with JS-interop. However, JavaScript's substring is not guaranteed to be constant-cost, and even with heroic effort on behalf of the JavaScript implementation, there might be some operations on some implementations that cause copying (e.g. a substring of a cons-string might be linearized for indexing, leaving the original full string in cons-string form, susceptible to repeated linearization of aliased views).

  2. The slice type is a separate type that does not implement String. Since String is not a supertype, slices cannot be passed to JS-interop. This would be detected statically only when the slice is passed to an API parameter with String type. Often the types are dynamic to capture that the JS code accepts numbers or Strings. (The slice type could have a JavaScript toString method for coercion, but the contexts where this happens are not universal, e.g. as a key to a Map).

The Span prototype for extended grapheme clusters tries to be index-free and addresses some of these issues by creating the 'substring view' as the result of queries. This is potentially a cleaner API since it combines several operations.

The Java implementation that you describe has long been abandoned - a quick look at openjdk8-11 confirms that Java now has a copying substring operation. I think one of the problems is that you can leak huge strings by having only substring references to parts of the string, making effective GC even more difficult.