Open lrhn opened 5 years ago
Having an efficient substring version of String
sounds really attractive. How much more ambitious would it be than adding all those "end index" elements to the current API? ;-)
There are two design points that could work with JavaScript:
The slice type implements String
but is otherwise private. In JavaScript it is a String. The String type therefore still works with JS-interop. However, JavaScript's substring
is not guaranteed to be constant-cost, and even with heroic effort on behalf of the JavaScript implementation, there might be some operations on some implementations that cause copying (e.g. a substring of a cons-string might be linearized for indexing, leaving the original full string in cons-string form, susceptible to repeated linearization of aliased views).
The slice type is a separate type that does not implement String
. Since String is not a supertype, slices cannot be passed to JS-interop. This would be detected statically only when the slice is passed to an API parameter with String type. Often the types are dynamic
to capture that the JS code accepts numbers or Strings. (The slice type could have a JavaScript toString method for coercion, but the contexts where this happens are not universal, e.g. as a key to a Map
).
The Span
prototype for extended grapheme clusters tries to be index-free and addresses some of these issues by creating the 'substring view' as the result of queries. This is potentially a cleaner API since it combines several operations.
The Java implementation that you describe has long been abandoned - a quick look at openjdk8-11 confirms that Java now has a copying substring operation. I think one of the problems is that you can leak huge strings by having only substring references to parts of the string, making effective GC even more difficult.
The current
indexOf
andlastIndexOf
operations search forward and backwards from a given start position, but there is no way to limit the end position.If, for example, you want to find two delimiters, then look for something between them:
Here, I want to parse an HTML tag (bad idea, I know) and look for the space after the tag name, if any. However, a tag like
<br>
won't have that space, and I'll just search into the potentially very long part of the string after the tag end marker.If I had a way to delimit the search further, perhaps and optional
endIndex
parameter forindexOf
, I could just writevar tagNameEnd = string.indexOf(" ", tagStart, tagEnd)
and be sure that I don't get any match of of" "
that isn't in the substringstring.substring(tagStart, tagEnd)
.The
lastIndexOf
is a little more complicated because the existing optional "start" position is actually the end of the searched area, and not event that, it is the top limit on the start of the string that is found, so the match can extend beyond that index. It might be easier to add a second methodlastIndexInRange(start, end, pattern)
, and we could haveindexInRange(start, end, pattern)
as well.Operations that would be nice to have on substring without having to allocate the substring object are:
indexOf
,lastIndexOf
startsWith
,endsWith
(easily worked around, but would be convenient)replaceAll
,replaceAllMapped
split
,splitMapJoin
toUpperCase
,toLowerCase
trim
,trimLeft
,trimRight
,padLeft
,padRight
.int.parse
,double.parse
, ... any other.parse
Many these operations create a new string based on an existing string. If the existing string happens to be a substring of another string, you need to create the intermediate substring in order to do the operation only on the part that you care about. Then the intermediate string is thrown away. It would be more efficient if you could do the operation directly on a substring of an existing string.
Alternatively, and more ambitiously, we could add a
slice
method toString
which returns an object representing a substring, with all the same methods asString
. It should probably even implementString
. The important part is that it is guaranteed efficient. It does not copy the contents of the original string (unless that's cheaper than allocating an object containing a string pointer and two integers - which can likely be 30 bit positive integers without any issue, so storable in two 32-bit smis or one 64-bit smi).Then you could do
string.slice(0, tagEnd).indexOf(" ", tagStart)
to limit your search bytagEnd
.I remember Josh Bloch stating that making the
substring
operation in Java cheap gave a better API. It meant that you didn't need all these extra indices on operations. You made the API to apply to the entire string, and if users wanted to work on a substring, they simply created a substring, which didn't require copying the original string.Such a string slice is also nice if you are modifying a string more than once. A user-implemented
String.replaceAll
function would include a number of substrings of the base string between the replacements. Those intermediate strings are not needed, and using cheap slices would likely be more efficient (this is what the library version does internally, but it can only do that because it can write directly to the string being created). A libraryString.concat
which takes a list of strings, which might be slices, and does an efficient concatenation without unnecessary intermediate allocations, would make such thing achievable by to users.Compilers should generally be able to inline the object allocation of the slice. Strings are sealed, so it's easily recognizable when you know the type.
The
slice
operation here would be equivalent to that. JavaScript is simple because most JavaScript implementations already do substring lazily.