Roadmap to 1.0 - Githubissues

cessen commented 6 years ago

I plan for Ropey to reach 1.0 at some point. I don't want it to be a library that is perpetually stuck in beta. This is the list of things that need to happen before I declare 1.0.

[x] Remove Grapheme support. After working with Ropey for a bit, it seems clear that this should be handled on top of Ropey, not inside it.
[x] Verify that grapheme support (iterators, finding prev/next grapheme boundary, etc.) can be implemented sanely and efficiently on top of Ropey. Probably do this as an example in the repo.
[x] Decide on conventions for converting between line endings and chars/bytes (see issue #11).
[x] Implement non-trivial text editor functionality using Ropey as the backing text buffer, to validate Ropey's design and APIs:
- [x] Text search (done: see the search-and-replace example in the repo)
- [x] Syntax highlighting (done: see Smith editor)
- [x] Word wrapping (done: see Led editor)
- [x] Basic undo/redo (done: see Led editor)
- [x] Loading/saving text files with non-utf8 text encodings (done: see the latin1 example in the repo)

Lokathor commented 6 years ago

So, I tried to do the editor thing on the weekend. Things went okay for a first attempt, and I filed one issue that quickly came up as you saw.

My use of Ropey for that, and thinking about how I might want to use it in another project, quickly revealed another thing: how do you split by words to do word-wrapping?

In std you can use split_whitespace() and split(char) to split up a &str into an iterator over the splits. Ropey doesn't seem to have a similar ability shown in the docs of 0.6.3, despite the check box above being filled in.

There's stuff in the linked issue about a graphime cluster deal but i am not trained with unicode so I do not know what that means. If the API for ropey wants to use them as the main splitting concept that's fine, but the readme should say in big bold letters what that is and how it relates to the char data type that we already use. For that matter, how it relates to a C char, a u8, and a u16, all of which I'm sure are vital to know about for FFI passing on linux and windows.

Basically, you gotta make the case for using this data type over Vec<Vec<char>> or something else that's "stupid but easy to understand".

cessen commented 6 years ago

Ah, apologies for the confusion. The "custom text segmentation" check box is referring specifically to custom grapheme clusters, and has nothing to do with what you're trying to accomplish. I'll fix the label to avoid similar confusion in the future. You can probably ignore graphemes for now, though you'll need to learn about them if you want your editor to reliably behave correctly outside of the ascii range of characters (or handle windows-style line endings correctly).

Splitting by white space, wording wrapping, etc. are higher-level than Ropey is intended to be. The expectation is for Ropey to be a relatively low-level component of an editor, and for things like word splitting, text searching, syntax highlighting, etc. to be implemented on top of it. As an example, my toy editor Led uses Ropey, and implements word wrapping on top of it.

The best way to do word splitting with Ropey right now is to use one of the iterators (probably Chars) to find the breaks. You can build your own iterator on top of Chars that yields a word at a time, to make things more convenient for the rest of your code. That's essentially what I do in Led.

In the future, using a regex search iterator could also be a reasonable way to go about it. Unfortunately, Rust's regex crate currently only supports searching text that is completely contiguous in memory, so it can't be used with Ropey. I'm currently working on a fork of the regex crate to address that issue, because this is a common problem for pretty much all editable text buffer representations, not just Ropey.

Basically, you gotta make the case for using this data type over Vec<Vec> or something else that's "stupid but easy to understand".

I think if someone can get away with Vec<Vec<char>>, Vec<String>, or something like that, then they probably aren't in the target use-case for Ropey anyway. Nevertheless, I'm sympathetic to your frustrations: text handling is an unexpectedly deep rabbit hole to dive down. Explaining Unicode is beyond the scope of Ropey's documentation, but I do at least try to link to appropriate external documentation when relevant. Are there particular parts of the documentation that are tripping you up, where I could do a better job of leading the reader to appropriate outside documentation?

Lokathor commented 6 years ago

So a char is like a thing you can (probably?) show, and a Graphime is 1 or more char values chained into a single thing you show, usually 1 "normal" character and 0 or more "accent" characters. Except that also some char values are the whole thing (base + accent) as a single character (like 'ĝ' and 'ŝ', or whatever, I guess). -- This is my understanding of things. I guess add to that if there's something I'm missing. If that's true I'd say throw that explanation somewhere in the docs and then link to the "full explanation" with a "see more" link.

I think if someone can get away with Vec<Vec>, Vec, or something like that, then they probably aren't in the target use-case for Ropey anyway.

What is the use case? Because you lose access to a lot of the rust ecosystem by using it instead of the normal types. I don't want to be a jerk or anything, but ya really gotta make that case for why someone should do that and lose use of all the crates that do things to and with String, &str, and CoW<String> and all those "normal" string types. The README.md says "Ropey also ensures that grapheme clusters are never split in its internal representation, and thus can always be accessed as &str slices.", so at first it might seem that you're not really losing out, but I don't actually see a Deref impl or other way to get a &str out of a Rope or a RopeSlice, so how do you access it as a &str slice?

Like, if an editor is using Vec as their buffer type, or even just one huge String value, what are they "getting away with" by using that format? What's worse about it? Right now all I know are that I wanted to make an editor like nano using curses+ropey, and curses did fine, but ropey fell over.

cessen commented 6 years ago

First off:

I don't want to be a jerk or anything

No worries! I'm eager for feedback on this crate, so I really appreciate you taking the time to elaborate on your experience and perspective.

So a char is like a thing you can (probably?) show, and a Graphime is 1 or more char values chained into a single thing you show, usually 1 "normal" character and 0 or more "accent" characters.

That's pretty much correct. Basically, a grapheme is what a human would think of as a single character. Very often graphemes are a single char, but not always.

The simplest example of a multi-char grapheme is probably Windows-style line endings. In Windows, a single line ending is typically represented with two chars placed next to each other: CR and LF. If you treat them as two individual characters, your editor will have some strange behaviors such as users having to hit backspace twice to remove a line break.

You mention accented characters, which are another great example, but they're only one of several. (Also as you note, the most common accented characters also have single-char representations.)

I think what I might do is put a separate markdown file in the repo with a quick introduction to some of these concepts, and direct people to it in the readme. I'm also thinking I'll remove mentions of graphemes from the readme, since it seems like it may just confuse people without providing more context.

Nevertheless, Ropey is targeted at people who mostly are already familiar with these concepts. And I don't mean that in an elitist way, I mean that in a "You naturally end up stumbling into these things when trying to build a unicode-capable text editor anyway," kind of way.

What is the use case?

It's meant to be the low-level text buffer representation for things like text editors.

I'll approach this from two different angles.

First: I think part of your frustration may come from a marketing problem on my side. I probably need to make it clearer where Ropey "slots in", so to speak. I really think your idea of what Ropey is intended to be is higher level than Ropey actually is. What you're looking for (I think) is something that would be built on top of something like Ropey.

If you take a look at other text buffer crates in the Rust ecosystem such as xi-rope, you'll see that they also have a similar level of API (give or take a little). And that's because this actually is a level at which it makes sense to solve a problem. Having said that, the scribe crate takes it further and builds more functionality on top of its buffer, so that may be worth looking into if you want something a little higher-level.

Second: I'm definitely open to adding a customizable iterator for iterating over split text, but I don't want to add anything like a Words or SplitWhitespace iterator, because in Ropey's target use-case those would actually be fairly niche. The way you define "word" really depends on context (e.g. Rust vs English, or syntax highlighting vs word wrapping), so I don't want to bake such a decision into Ropey since it would largely be ignored by most users of the library anyway. It also might be a little misleading ("Oh, I can use this Words iterator and everything will just work.").

This is in contrast to e.g. line breaks which are both very well defined and extremely useful to be able to directly index into, hence their being featured prominently in Ropey's APIs.

so how do you access it as a &str slice?

I think I'm over-hyping that in the readme. It's actually not that big a deal, but I was excited about it because it's a problem I had run into before and wanted to solve. In any case, the answer is: through the Graphemes iterator.

The motivation is that since a single printable character may actually be made up of multiple chars, you want to print a grapheme at a time rather than a char at a time. So the lack of grapheme splitting allows the Graphemes iterator (and Chunks iterator, for that matter) to "just work", returning copy-less/allocation-less slices of utf8 text, which is easier for things on the display side to process than a RopeSlice.

cessen commented 6 years ago

Like, if an editor is using Vec as their buffer type, or even just one huge String value, what are they "getting away with" by using that format? What's worse about it?

Performance, by a very large margin. Especially as a text document grows, using something like a String or even Vec<String> will start to choke on large documents. For simple toy editors working on small documents, those can be fine, but for anything that's meant to be a robust editor it would be a very poor choice.

Ropey (and other text buffers like it), on the other hand, can scale to text files in the gigabytes and beyond without breaking a sweat. Not that anyone typically edits files that large, but just to illustrate extremes.

Lokathor commented 6 years ago

So, once you have a &str, you just iterate the characters in that and throw each one into the output buffer, and then the zero-width parts will all stack up with the normal parts and combine into a single display update?

cessen commented 6 years ago

That's not quite the intent. You can do that just as easily with the Char iterator.

An example is using the draw() function of Led's screen API. I can pass it a full grapheme as a &str slice, and it will draw the resulting character at the given coordinates on screen.

It mostly just makes things more convenient when rendering text. There's no deep truth in it or anything, and you can accomplish the same things with a char iterator in theory, but with some more leg work.

Honestly, this discussion is making me wonder if I shouldn't just remove the grapheme related stuff entirely, and make it the responsibility of the client code to handle it. It would simplify the API quite a bit and leave things a bit more flexible for the client code, even if it would be a bit more work on that side. And adding a custom split iterator to Ropey could make the iterating-over-graphemes use-case more convenient again. Hmm...

Lokathor commented 6 years ago

Well, so, I keep drilling down on it because the editor that I intended to make was going to be curses based so that it could run cleanly on my rpi3 in both X mode or over SSH. Obviously Curses (and thus anything based on it) is a character based API. You can output a character or not, and you can read a key_typed event or not (which might produce 1 character, or it might be some function key with no associated character). You can also move the cursor and set colors and crap, but basically that's your whole IO flow, read characters (technically wchar_t) and write them 1 at a time. There's printw available which might seem like "the way to print a whole string at once", but all that really does internally is allocate a CString to pass to curses, and then curses iterates over the CString it got with the normal printing, so printw basically only exists to cause an extra intermediate allocation.

So if the &str produced as a grapheme cluster is 1 char long that's cool and simple. If it's more than 1 char long... what is curses expected to even do at that point? You can iterate the chars of the string, but that seems to defeat the point in the first place.

As you say, the answer might actually just be "the grapheme part of the API is useless to curses-based programs".

That seems to kinda maybe be the answer based on the actual definition of draw that you're using. The difference being that you're handling all the unicode interpretation stuff there and curses would basically be throwing bytes at the terminal driver and relying on the user's local system encoding to do the heavy lifting.

cessen commented 6 years ago

I'm not really familiar with how curses works, but if it prints one wchar_t at a time, then that suggests to me that it doesn't know how to handle graphemes at all. But I'm not really sure.

[...] throwing bytes at the terminal drive and relying on the user's local system encoding to do the heavy lifting.

Yeah, I'm relying on the underlying terminal being utf8-based and knowing how to render graphemes properly. For a GUI-based editor, I would need to handle grapheme-based glyph lookups myself and render those, which is also a bit more straightforward if I can just get a &str slice for each grapheme.

But the more I'm thinking about this, the more I'm nearing the conclusion that the low-level text buffer is probably the wrong level at which to handle grapheme segmentation. The hoops I've jumped through to make it customizable kind of suggests that as well, and I don't like how it complicates the API. I'd like Ropey to have a fairly small API surface area if possible, that is as easy to understand as possible.

The important thing is to make sure that handling graphemes efficiently is still possible for higher-level code that uses Ropey. And that's the sort of thing that motivates the "Implement a non-trivial text editor" check box on this issue: make sure things work well in practice.

Lokathor commented 6 years ago

Oh, it absolutely doesn't. Curses started in 1980, far before any of the more interesting text handling concepts began to be codified. These days there's upgrades, such as versions using wchar_t at least.

However, it's still how a lot of the world thinks of text unfortunately. Using pancurses/easycurses is also about the only way (in any language ever) to get a terminal-ish editor that works across both win32 and *nix systems. I mean you can use termio or something to "bypass" the normal curses limits, but then you don't work on win32, which is basically a bad plan. So curses, or maybe something from scratch using opengl, is about the only way to go.

cessen commented 6 years ago

Ah, that makes sense. I feel like another way to go about it would be to use some kind of terminal detection, and build your drawing calls to print in the appropriate way based on that. But at that point you're basically building your own terminal library. (Which isn't the worst idea, but is also a lot of work.)

I actually really like Termion, because it lets me interact with things at a lower level. When I was using e.g. RustBox (a termbox wrapper) before that, I ran into weird issues trying to manage unicode printing, and it was hard to fix because the library abstracted too many things away. But, indeed, it doesn't properly support windows terminals (yet).

It would be nice to have something more on the level of Termion, but that tries to be cross-platform. I don't really mind handling my own off-screen buffer etc., and in fact like having the control to manage that kind of stuff myself. Maybe I'm weird that way. :-)

cessen commented 6 years ago

Removed bidirectional iterators from the 1.0 checklist. That can always be added later in a backwards-compatible way, and isn't necessary for Ropey to be stable and useful.

cessen commented 6 years ago

There are now examples in the examples directory showing how to efficiently implement:

An iterator over graphemes of a Rope/RopeSlice.
Functions to find the prev/next grapheme boundary from a char position in a Rope/RopeSlice.
A function to determine if a given char position in a Rope/RopeSlice is a grapheme boundary.

cessen commented 5 years ago

I think Ropey is pretty much ready for a 1.0 release now. I would like to do a final documentation + examples pass before publishing, but other than that I believe this is ready to go!

cessen commented 5 years ago

1.0 is released!

cessen / ropey

Roadmap to 1.0 #8