ashinn / chibi-scheme

Official chibi-scheme repository
Other
1.23k stars 141 forks source link

Add two new immutable-string procedures #860

Closed johnwcowan closed 2 years ago

johnwcowan commented 2 years ago

Add a procedure immutable-string that creates a new immutable string object from a mutable one. GIven (define is (immutable-string s)), the effect is that is gets the immutability bit set (currently used only for literals), and s gets a new copy-on-write bit set, so that if s is mutated, its bytes are copied and replaced. This allows the efficient O(1) creation of immutable strings. The GC would have to be trivially modified to keep track of having multiple pointers to the same byte sequence from more than one string object.

The other useful procedure is, of course, immutable-string?.

ashinn commented 2 years ago

Strings already use a level of indirection wrapping a normal bytevector, so no GC changes are required (if they were I probably wouldn't consider this). And actually, we already do copy-on-write every time you do a string-set! that changes the utf8 size.

Is this for R7RS large?

johnwcowan commented 2 years ago

Yes. With immutable literals, O(1) string-ref, O(1) conversion, and a way to check what's mutable and what's immutable, we have a good way to escape the String Mess once and for all. All we need is a SRFI 152 lookalike that returns immutable strings, with the exception of the R7RS procedures and string-concatenate(-reverse), which should be compatible with string-append.

I note that you still have cursors despite the O(1) string-ref optional feature; why not just make it the standard thing? 64 characters might be a bit short; I've been pushing the idea to Felix with n=1024.

ashinn commented 2 years ago

I don't really have any intention of making string-ref O(1) by default because it slows down string construction.

lassik commented 2 years ago

Where does the requirement for O(1) string-ref come from? Doesn't it require a vector of integers internally? Assuming max 4 GiB strings, hence 32-bit indexes, that's a vector 4 times the string length.

IMHO string-ref is something of an anti-pattern and shoudn't be guaranteed to be fast. (Instead of random access, most algorithms just walk up or down the string.) I assume string-ref will never be called on the vast majority of Scheme strings.

One trick (that I don't know whether or not anyone is using) is to store the index and byte position of the last looked-up character internally. If the next lookup is near that index, it will be very fast to compute the byte position by walking UTF-8 up or down from the saved byte position.

lassik commented 2 years ago

One more point - I assume (but haven't benchmarked) that mandatory lookup tables are not very CPU cache friendly, and may slow down GC.

mnieper commented 2 years ago

I don't really have any intention of making string-ref O(1) by default because it slows down string construction.

By what factor is string construction being slowed down? Is there a fundamental reason for why a slow down is noticeable?

Where does the requirement for O(1) string-ref come from? Doesn't it require a vector of integers internally? Assuming max 4 GiB strings, hence 32-bit indexes, that's a vector 4 times the string length.

You wouldn't store every position, just every nth (with n being 64, 1024, whatever).

Alternatively, if the requirement were lowered to O(log n), you could do a binary search over the string that would have to have some synchronizing marker embedded at every nth position.

IMHO string-ref is something of an anti-pattern and shoudn't be guaranteed to be fast. (Instead of random access, most algorithms just walk up or down the string.) I assume string-ref will never be called on the vast majority of Scheme strings.

Using string-ref instead of cursors means one API less (and makes cursors implementable in terms of more primitive procedures). It helps to reduce the number of accidentally quadratic algorithms. With cursors, one also has to make sure that using invalid cursors can be cheaply detected; with indexes, this is just one comparison against the string length.

One trick (that I don't know whether or not anyone is using) is to store the index and byte position of the last looked-up character internally. If the next lookup is near that index, it will be very fast to compute the byte position by walking UTF-8 up or down from the saved byte position.

This is fine for algorithms that only need one finger into the string. This approach needs some locking on multithreaded systems even if otherwise only (usually lock-free) atomic reads are requested.

One more point - I assume (but haven't benchmarked) that mandatory lookup tables are not very CPU cache friendly, and may slow down GC.

I don't understand how it is connected to GC. And can you explain your assumption about cache-friendliness?

ashinn commented 2 years ago

There is already a compile-time option for storing every nth position. There is also an open PR from @dpk for caching the last index. The problem with the former is string construction time. The problem with the latter is a false sense of security - you think string-ref is O(1) but in pathological cases it isn't.

The construction overhead bothers me a lot because one of the few things Chibi does reasonably fast is I/O. I'm trying to shave time off of read-line, not add to it.

I don't see any "mess" with strings because both cursor APIs and index APIs are equivalent yet low-level. There's no reason to go out of our way to make the low-level string-ref fast. We should be providing good high-level APIs. SRFI 13 (minus indexes) functional string programming, string ports, and expandable buffer/text data structures are all good candidates which don't need to use string-ref under the hood.

lassik commented 2 years ago

You wouldn't store every position, just every nth (with n being 64, 1024, whatever).

Right, that makes much more sense.

Good points on the rest.

I don't understand how it is connected to GC. And can you explain your assumption about cache-friendliness?

I mistakenly thought the intent was to cache every single byte position.

With cursors, one also has to make sure that using invalid cursors can be cheaply detected

String ports are a kind of cursor that is mandated by the standard. Though we don't have a read-char-backward procedure.

The problem with the latter is a false sense of security - you think string-ref is O(1) but in pathological cases it isn't.

+1. O(1) random-access Unicode algorithms seem like an over-specialized thing to support in a language standard.

I skimmed R6RS and R7RS-small for efficiency guarantees, and the only thing I found is:

A vector typically occupies less space than a list of the same length, and the average time needed to access a randomly chosen element is typically less for the vector than for the list.

That seems reasonable.

ashinn commented 2 years ago

This is trivial and barely touches the core so added to (chibi ast). Note immutable? was already available.