chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.78k stars 418 forks source link

borrowing strings #11222

Open mppf opened 5 years ago

mppf commented 5 years ago

In the D programming language, the file I/O mechanism has a very fast capability to return strings that point to the I/O system's buffer. Additionally the same technique can apply when getting substrings of existing strings.

This issue asks about the feasibility of such a technique in Chapel.

The General Idea

The idea is that some strings refer to some portion of already-allocated data. Some examples for where this could happen: strings read from a channel's I/O buffer, strings pointing to a memory mapped file, strings that are substrings of existing strings.

Such strings point to some portion of the existing buffer and don't "own" their pointer - that is, they don't free anything when they go out of scope.

For example, let's suppose we have

var buffer = "x" * 1000000;
var first = buffer[1..1];

buffer will store a pointer to some string data consisting of "xxxxxx...". first will store a pointer to that first "x" and have a length of 1.

Null Terminating

The first problem with this approach is that the strings won't be null-terminated. To date, Chapel strings are always null-terminated, to ease debugging and C integration. However this property is not necessary for C integration.

See https://chapel-lang.org/docs/builtins/String.html#String.string.c_str -- here it points out that to correctly get a C string referring to an arbitrary string, one must first call .localize() on that string. What does .localize() do? It creates a copy of the string on the current locale. Additionally note that the c_string type in Chapel refers to const char* in C. These two facts indicate that in the current implementation, string.c_str() can but need not return a pointer that aliases the original string. Instead it is acceptable (for correctness) for it to return a copy of that pointer (as long as that copy is freed appropriately).

As a result, we could modify string.c_str() to:

It's a little bit unclear how to implement this in Chapel without complicating the string type. Some options:

Lifetime Issues

The second problem with this approach is that it might be surprising in some cases that a string might refer to invalid memory. Cases where this might come up include returning substrings or strings read from a temporary channel.

Here the Chapel languages already has two techniques that could be applied:

  1. For arrays, it is possible to have arrays that alias other arrays temporarily, but when they are saved into a variable, a copy is made if necessary to remove this aliasing.
  2. For borrowed class instances, the Chapel compiler includes compile-time checking that makes it an error to return something that shares storage with a local variable.

Either or both of these techniques could be applied here to reduce the chances of introducing memory errors.

mppf commented 3 years ago

In #11158 in spent some time looking at the knucleotide-strings.chpl benchmark. It makes a lot of substrings that could be "borrowed strings" or "string views".

Note also that we could consider having a "borrowed string"/"string view" be a different type. In that event, we might want the substring calls to return a different type, which would be a breaking change. (Of course, we could also create different methods, at a risk of making the API harder to use).

bradcray commented 3 years ago

I believe that users have, at times, expressed desire to borrow from types other than classes, like arrays, which seems thematically similar to borrowing from strings.

mppf commented 3 years ago

Arguably array views already are a form of "array borrow" but strings don't have an equivalent.

bradcray commented 3 years ago

True, though I think users have also wanted other / more general forms of borrowing arrays as well. E.g., declaring a new array that borrows its data from another (e.g., https://github.com/chapel-lang/chapel/issues/18163, https://github.com/chapel-lang/chapel/issues/18164, and other recent user conversations); having an array field within a class/record that borrows its data rather than getting new data (where ref fields are another technique that we sometimes discuss for that).